Skip to main content

Async Batch Document Processing

A job is a named container of file submissions. Use POST /files/v1/jobs to submit up to 200,000 files in one call, then poll the job for completion and list its files.

EndpointDescription
POST /files/v1/jobsSubmit a batch of up to 200,000 documents in one call
GET /files/v1/jobs/{job_id}Job status + counters
GET /files/v1/jobs/{job_id}/filesPaginated per-file listing, with optional status filter
GET /files/v1/jobs/{job_id}/files/{custom_id}Fetch one file by your own custom_id

POST /files/v1/jobs

POST api.mathpix.com/files/v1/jobs

Submit up to 200,000 documents for async processing in one call. Returns a job_id and file_count (number of items accepted). Partial success is reported via a rejected[] array — invalid items don't fail the whole call.

Example

curl -X POST https://api.mathpix.com/files/v1/jobs \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{
"job_id": "contracts-2026-05",
"files": [
{ "source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1" },
{ "source_uri": "https://example.com/manual.pdf", "custom_id": "manual" }
],
"conversion_formats": { "docx": true, "md": true }
}'
Example response (all accepted)
{
"job_id": "contracts-2026-05",
"file_count": 2
}
Example response (partial success)
{
"job_id": "contracts-2026-05",
"file_count": 1,
"rejected": [
{ "index": 1, "source_uri": "s3://other-bucket/doc.pdf", "custom_id": "manual", "reason": "data_source_not_found" }
]
}

Request parameters

In addition to the Files-API-specific parameters below, this endpoint accepts the same OCR and conversion options as POST /v3/pdf (alphabets_allowed, rm_spaces, math_inline_delimiters, include_smiles, etc.). Those options apply to every file in the submitted request. You can make multiple POST /files/v1/jobs requests to the same job_id with different per-request OCR options if you need to vary settings across subsets of a larger job.

files array

List of per-file submissions — each item describes one document. Item fields:

source_uri string

Source URI for this file. Same accepted schemes as POST /files/v1/uri: s3://, gs://, public https://, or Azure Blob HTTPS URL. Items can mix schemes within a job.

custom_id string (optional)

Per-item customer-supplied identifier. Max 256 chars, [A-Za-z0-9_\-.:], case-sensitive. Used for correlation in GET /files/v1/jobs/{job_id}/files and for idempotency. Supplying any custom_id requires an explicit top-level job_id (a server-generated one is not allowed) — see the job_id parameter below.

filename string (optional)

Display name for this file, returned in the file listing and used to name outputs. Defaults to <file_id>.pdf when omitted.

destination_uri string (optional)

Per-file destination for this file's results. When set, outputs land at <destination_uri>/<destination_basename>.<ext> per requested format, and cropped images (when image_output_mode is "local") under <destination_uri>/images/. Give each file its own prefix so different files' outputs don't share one flat folder. Requires a registered data source for the bucket. Omit it to keep this file's results retrievable from Mathpix instead. There is no job-wide destination — set it per item.

s3_region string (optional)

Region of this file's destination_uri S3 bucket. Defaults to us-east-1.

destination_basename string (optional)

Basename for this file's output objects within its destination_uri — results land at <destination_uri>/<destination_basename>.<ext>. Defaults to the file_id.

page_ranges string (optional)

Process only a subset of pages, e.g. "1-5,8" (same syntax as POST /v3/pdf). Omit to process the whole document.

conversion_formats object (optional)

Job-wide conversion formats; applied to every file in the batch. Same shape as POST /v3/pdf — e.g. { "docx": true, "md": true }. No per-item override in v1.

image_output_mode string (optional)

Job-wide. Set to "local" to write cropped images (figures, equation crops, etc.) into each file's destination_uri storage alongside the converted outputs, referenced by relative path from the Mathpix Markdown — the SCS-compatible layout. When unset (default), cropped images stay on Mathpix's CDN and are referenced by URL. Applies only to files that set a destination_uri. Same behavior as on POST /files/v1/uri.

job_id string (optional)

Caller-supplied job id. Optional in general — if you omit it the server generates one — but required whenever any item carries a custom_id (the (job_id, custom_id) idempotency key needs a job id you control; a server-generated one can't be replayed on retry). Also useful when you want predictable job ids for idempotent submission. See Idempotency.

Response

{
"job_id": "<uuid>", // the job's id (use it for status and listing)
"file_count": 2, // number of items accepted into the job
"rejected": [ // present only when any item failed; absent otherwise
{
"index": 7, // position in your input array
"source_uri": "s3://bucket/missing.pdf",
"custom_id": "doc-7",
"reason": "data_source_access_denied" // one of the closed error codes
}
]
}

rejected[] is omitted when every item was accepted — its presence is the signal that you should retry some items.

Failure modes

SituationOutcome
One or more items invalid in an otherwise-valid requestPartial success. 200, valid items accepted, invalid items in rejected[]. Resubmit just the rejected items.
A file's destination_uri is malformed, or its bucket has no registered data sourcePer-file, not whole-job. The destination isn't validated synchronously at submit; the file is accepted, then reaches error status (counted in files_errored) when its result write fails. Other files are unaffected.
Item N's (job_id, custom_id) was already submittedIdempotency hit — counts as a success with the original file_id; not in rejected[]. See Idempotency.

Limits

See Limits and quotas for the full launch envelope and how the monthly page quota is shared with v3/pdf.

Idempotency

A submission is uniquely keyed by (job_id, custom_id). Resubmitting an item with the same pair returns the original file_id rather than creating a new submission. This makes retries safe.

  • Both parts are required to opt in: supply your own job_id and a per-item custom_id on the original call. Because custom_id requires an explicit job_id, idempotency is never available on a server-generated job id (you wouldn't know it to replay on retry).
  • If custom_id is absent, idempotency does not apply — each call creates a new submission.

Whole-batch idempotency via Idempotency-Key

If you don't supply your own job_id, you can still make the entire batch submission safe to retry by sending an Idempotency-Key request header (same constraints as custom_id: max 256 chars, [A-Za-z0-9_\-.:]). The server derives a deterministic job_id from your app_key + the key, so re-sending the same request — same key, same files — returns the original { "job_id", "file_count" } without re-enqueuing any file. Use this to safely retry a POST /files/v1/jobs call that timed out or whose response you never received.

  • The header is honored only when no job_id is supplied. If you send an explicit job_id, it wins and the header is ignored for job derivation.
  • This is batch-level dedup (the whole submission), distinct from the per-item (job_id, custom_id) dedup above. The two compose: within an idempotent batch, per-item custom_ids still dedup individual files.

For single-item submissions via POST /files/v1/uri without a job_id, the same Idempotency-Key header works at the single-file level.


GET /files/v1/jobs/{job_id}

GET api.mathpix.com/files/v1/jobs/{job_id}

Returns the job's status and counters. Poll this endpoint to detect job completion.

curl -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/jobs/7e2a55d9-3a51-4d2c-9c8a-2c1f3e4f5d6b
// Response 200
{
"job_id": "7e2a55d9-3a51-4d2c-9c8a-2c1f3e4f5d6b",
"status": "processing", // "processing" | "completed"
"file_count": 200, // total files accepted into this job
"files_completed": 150, // files that have produced final results
"files_errored": 5, // files in terminal error state
"created_at": "2026-05-28T12:00:00Z",
"modified_at": "2026-05-28T12:14:32Z"
}

Job is complete when status == "completed". Per-file failures don't fail the job — they're counted in files_errored. Use GET /files/v1/jobs/{job_id}/files?status=error to enumerate them.


GET /files/v1/jobs/{job_id}/files

GET api.mathpix.com/files/v1/jobs/{job_id}/files

Paginated listing of files in the job. Supports a status filter (e.g. for "show me only the errored ones").

Query parameters

ParamTypeDescription
statusstringOptional. Filter to one of pending, split, completed, error. Omit to list all.
paging_statestringOptional. Opaque pagination cursor from the previous response's next_page_token.
limitintOptional. Maximum items per page.

Example

curl -H 'app_key: APP_KEY' \
'https://api.mathpix.com/files/v1/jobs/7e2a55d9-3a51-4d2c-9c8a-2c1f3e4f5d6b/files?status=error'
// Response 200
{
"files": [
{
"file_id": "f7d3a210-6c4e-49f3-bd5e-8e1c2f4d6b9a",
"custom_id": "doc-7",
"filename": "missing.pdf",
"status": "error",
"created_at": "2026-05-28T12:03:11Z"
}
],
"next_page_token": "..." // absent when this is the last page
}

Iterate by passing the previous response's next_page_token back in as paging_state until the field is absent.

Failed-files lookup

The common pattern after a job completes:

import requests
files, page = [], None
while True:
params = {"status": "error"}
if page: params["paging_state"] = page
r = requests.get(f"https://api.mathpix.com/files/v1/jobs/{job_id}/files",
params=params, headers={"app_key": "APP_KEY"})
body = r.json()
files.extend(body["files"])
page = body.get("next_page_token")
if not page: break
print(f"{len(files)} files errored. custom_ids: {[f['custom_id'] for f in files]}")

Use the returned custom_ids to drive your retry logic against the original input set.


GET /files/v1/jobs/{job_id}/files/{custom_id}

GET api.mathpix.com/files/v1/jobs/{job_id}/files/{custom_id}

Fetch a single file by the (job_id, custom_id) you supplied at submit — no need to track our file_id. Returns the same body as GET /files/v1/{file_id}, at any status.

Example

curl -H 'app_key: APP_KEY' \
'https://api.mathpix.com/files/v1/jobs/2026-05-invoices/files/contract-1'
// Response 200 — same shape as GET /files/v1/{file_id}
{
"file_id": "f7d3a210-6c4e-49f3-bd5e-8e1c2f4d6b9a",
"custom_id": "contract-1",
"status": "completed"
// ...formats / conversion status, as in GET /files/v1/{file_id}
}

Both job_id and custom_id are required. An unknown (job_id, custom_id) — or one belonging to another account — returns 404 not_found (the two cases are indistinguishable by design). This lookup is available only when the original submission supplied your own job_id and a per-item custom_id (the same pair used for idempotency).


See also