Async Batch Document Processing
A job is a named container of file submissions. Use POST /files/v1/jobs to submit up to 200,000 files in one call, then poll the job for completion and list its files.
| Endpoint | Description |
|---|---|
| POST /files/v1/jobs | Submit a batch of up to 200,000 documents in one call |
| GET /files/v1/jobs/{job_id} | Job status + counters |
| GET /files/v1/jobs/{job_id}/files | Paginated per-file listing, with optional status filter |
| GET /files/v1/jobs/{job_id}/files/{custom_id} | Fetch one file by your own custom_id |
POST /files/v1/jobs
POST api.mathpix.com/files/v1/jobs
Submit up to 200,000 documents for async processing in one call. Returns a job_id and file_count (number of items accepted). Partial success is reported via a rejected[] array — invalid items don't fail the whole call.
Example
- cURL
- Python
- JavaScript / TypeScript
curl -X POST https://api.mathpix.com/files/v1/jobs \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{
"job_id": "contracts-2026-05",
"files": [
{ "source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1" },
{ "source_uri": "https://example.com/manual.pdf", "custom_id": "manual" }
],
"conversion_formats": { "docx": true, "md": true }
}'
import requests
r = requests.post("https://api.mathpix.com/files/v1/jobs",
json={
"job_id": "contracts-2026-05",
"files": [
{"source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1"},
{"source_uri": "https://example.com/manual.pdf", "custom_id": "manual"},
],
"conversion_formats": {"docx": True, "md": True},
},
headers={"app_key": "APP_KEY", "Content-Type": "application/json"},
)
print(r.json()) # {"job_id": "...", "file_count": 2}
const response = await fetch("https://api.mathpix.com/files/v1/jobs", {
method: "POST",
headers: {
app_key: "APP_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
job_id: "contracts-2026-05",
files: [
{ source_uri: "s3://customer-bucket/docs/contract-1.pdf", custom_id: "contract-1" },
{ source_uri: "https://example.com/manual.pdf", custom_id: "manual" },
],
conversion_formats: { docx: true, md: true },
}),
});
const { job_id, file_count, rejected } = await response.json();
console.log(`Job ${job_id} accepted ${file_count} files`);
{
"job_id": "contracts-2026-05",
"file_count": 2
}
{
"job_id": "contracts-2026-05",
"file_count": 1,
"rejected": [
{ "index": 1, "source_uri": "s3://other-bucket/doc.pdf", "custom_id": "manual", "reason": "data_source_not_found" }
]
}
Request parameters
In addition to the Files-API-specific parameters below, this endpoint accepts the same OCR and conversion options as POST /v3/pdf (alphabets_allowed, rm_spaces, math_inline_delimiters, include_smiles, etc.). Those options apply to every file in the submitted request. You can make multiple POST /files/v1/jobs requests to the same job_id with different per-request OCR options if you need to vary settings across subsets of a larger job.
files List of per-file submissions — each item describes one document. Item fields:
source_uri Source URI for this file. Same accepted schemes as POST /files/v1/uri: s3://, gs://, public https://, or Azure Blob HTTPS URL. Items can mix schemes within a job.
custom_id Per-item customer-supplied identifier. Max 256 chars, [A-Za-z0-9_\-.:], case-sensitive. Used for correlation in GET /files/v1/jobs/{job_id}/files and for idempotency. Supplying any custom_id requires an explicit top-level job_id (a server-generated one is not allowed) — see the job_id parameter below.
filename Display name for this file, returned in the file listing and used to name outputs. Defaults to <file_id>.pdf when omitted.
destination_uri Per-file destination for this file's results. When set, outputs land at <destination_uri>/<destination_basename>.<ext> per requested format, and cropped images (when image_output_mode is "local") under <destination_uri>/images/. Give each file its own prefix so different files' outputs don't share one flat folder. Requires a registered data source for the bucket. Omit it to keep this file's results retrievable from Mathpix instead. There is no job-wide destination — set it per item.
s3_region Region of this file's destination_uri S3 bucket. Defaults to us-east-1.
destination_basename Basename for this file's output objects within its destination_uri — results land at <destination_uri>/<destination_basename>.<ext>. Defaults to the file_id.
page_ranges Process only a subset of pages, e.g. "1-5,8" (same syntax as POST /v3/pdf). Omit to process the whole document.
conversion_formats Job-wide conversion formats; applied to every file in the batch. Same shape as POST /v3/pdf — e.g. { "docx": true, "md": true }. No per-item override in v1.
image_output_mode Job-wide. Set to "local" to write cropped images (figures, equation crops, etc.) into each file's destination_uri storage alongside the converted outputs, referenced by relative path from the Mathpix Markdown — the SCS-compatible layout. When unset (default), cropped images stay on Mathpix's CDN and are referenced by URL. Applies only to files that set a destination_uri. Same behavior as on POST /files/v1/uri.
job_id Caller-supplied job id. Optional in general — if you omit it the server generates one — but required whenever any item carries a custom_id (the (job_id, custom_id) idempotency key needs a job id you control; a server-generated one can't be replayed on retry). Also useful when you want predictable job ids for idempotent submission. See Idempotency.
Response
{
"job_id": "<uuid>", // the job's id (use it for status and listing)
"file_count": 2, // number of items accepted into the job
"rejected": [ // present only when any item failed; absent otherwise
{
"index": 7, // position in your input array
"source_uri": "s3://bucket/missing.pdf",
"custom_id": "doc-7",
"reason": "data_source_access_denied" // one of the closed error codes
}
]
}
rejected[] is omitted when every item was accepted — its presence is the signal that you should retry some items.
Failure modes
| Situation | Outcome |
|---|---|
| One or more items invalid in an otherwise-valid request | Partial success. 200, valid items accepted, invalid items in rejected[]. Resubmit just the rejected items. |
A file's destination_uri is malformed, or its bucket has no registered data source | Per-file, not whole-job. The destination isn't validated synchronously at submit; the file is accepted, then reaches error status (counted in files_errored) when its result write fails. Other files are unaffected. |
Item N's (job_id, custom_id) was already submitted | Idempotency hit — counts as a success with the original file_id; not in rejected[]. See Idempotency. |
Limits
- Per-item input rules (file type, size) follow
POST /files/v1/uri.
See Limits and quotas for the full launch envelope and how the monthly page quota is shared with v3/pdf.
Idempotency
A submission is uniquely keyed by (job_id, custom_id). Resubmitting an item with the same pair returns the original file_id rather than creating a new submission. This makes retries safe.
- Both parts are required to opt in: supply your own
job_idand a per-itemcustom_idon the original call. Becausecustom_idrequires an explicitjob_id, idempotency is never available on a server-generated job id (you wouldn't know it to replay on retry). - If
custom_idis absent, idempotency does not apply — each call creates a new submission.
Whole-batch idempotency via Idempotency-Key
If you don't supply your own job_id, you can still make the entire batch submission safe to retry by sending an Idempotency-Key request header (same constraints as custom_id: max 256 chars, [A-Za-z0-9_\-.:]). The server derives a deterministic job_id from your app_key + the key, so re-sending the same request — same key, same files — returns the original { "job_id", "file_count" } without re-enqueuing any file. Use this to safely retry a POST /files/v1/jobs call that timed out or whose response you never received.
- The header is honored only when no
job_idis supplied. If you send an explicitjob_id, it wins and the header is ignored for job derivation. - This is batch-level dedup (the whole submission), distinct from the per-item
(job_id, custom_id)dedup above. The two compose: within an idempotent batch, per-itemcustom_ids still dedup individual files.
For single-item submissions via POST /files/v1/uri without a job_id, the same Idempotency-Key header works at the single-file level.
GET /files/v1/jobs/{job_id}
GET api.mathpix.com/files/v1/jobs/{job_id}
Returns the job's status and counters. Poll this endpoint to detect job completion.
curl -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/jobs/7e2a55d9-3a51-4d2c-9c8a-2c1f3e4f5d6b
// Response 200
{
"job_id": "7e2a55d9-3a51-4d2c-9c8a-2c1f3e4f5d6b",
"status": "processing", // "processing" | "completed"
"file_count": 200, // total files accepted into this job
"files_completed": 150, // files that have produced final results
"files_errored": 5, // files in terminal error state
"created_at": "2026-05-28T12:00:00Z",
"modified_at": "2026-05-28T12:14:32Z"
}
Job is complete when status == "completed". Per-file failures don't fail the job — they're counted in files_errored. Use GET /files/v1/jobs/{job_id}/files?status=error to enumerate them.
GET /files/v1/jobs/{job_id}/files
GET api.mathpix.com/files/v1/jobs/{job_id}/files
Paginated listing of files in the job. Supports a status filter (e.g. for "show me only the errored ones").
Query parameters
| Param | Type | Description |
|---|---|---|
status | string | Optional. Filter to one of pending, split, completed, error. Omit to list all. |
paging_state | string | Optional. Opaque pagination cursor from the previous response's next_page_token. |
limit | int | Optional. Maximum items per page. |
Example
curl -H 'app_key: APP_KEY' \
'https://api.mathpix.com/files/v1/jobs/7e2a55d9-3a51-4d2c-9c8a-2c1f3e4f5d6b/files?status=error'
// Response 200
{
"files": [
{
"file_id": "f7d3a210-6c4e-49f3-bd5e-8e1c2f4d6b9a",
"custom_id": "doc-7",
"filename": "missing.pdf",
"status": "error",
"created_at": "2026-05-28T12:03:11Z"
}
],
"next_page_token": "..." // absent when this is the last page
}
Iterate by passing the previous response's next_page_token back in as paging_state until the field is absent.
Failed-files lookup
The common pattern after a job completes:
import requests
files, page = [], None
while True:
params = {"status": "error"}
if page: params["paging_state"] = page
r = requests.get(f"https://api.mathpix.com/files/v1/jobs/{job_id}/files",
params=params, headers={"app_key": "APP_KEY"})
body = r.json()
files.extend(body["files"])
page = body.get("next_page_token")
if not page: break
print(f"{len(files)} files errored. custom_ids: {[f['custom_id'] for f in files]}")
Use the returned custom_ids to drive your retry logic against the original input set.
GET /files/v1/jobs/{job_id}/files/{custom_id}
GET api.mathpix.com/files/v1/jobs/{job_id}/files/{custom_id}
Fetch a single file by the (job_id, custom_id) you supplied at submit — no need to track our file_id. Returns the same body as GET /files/v1/{file_id}, at any status.
Example
curl -H 'app_key: APP_KEY' \
'https://api.mathpix.com/files/v1/jobs/2026-05-invoices/files/contract-1'
// Response 200 — same shape as GET /files/v1/{file_id}
{
"file_id": "f7d3a210-6c4e-49f3-bd5e-8e1c2f4d6b9a",
"custom_id": "contract-1",
"status": "completed"
// ...formats / conversion status, as in GET /files/v1/{file_id}
}
Both job_id and custom_id are required. An unknown (job_id, custom_id) — or one belonging to another account — returns 404 not_found (the two cases are indistinguishable by design). This lookup is available only when the original submission supplied your own job_id and a per-item custom_id (the same pair used for idempotency).
See also
POST /files/v1/uri— single-document submission.- Files API Quickstart — guide walkthrough including jobs.
- Data sources — register the buckets you'll source from.
- Limits and quotas — full launch limit envelope.
- Error handling — the closed Files API error code set.