Async Batch Document Processing

A job is a named container of file submissions. Use POST /files/v1/jobs to submit up to 200,000 files in one call, then poll the job for completion and list its files.

Endpoint	Description
POST /files/v1/jobs	Submit a batch of up to 200,000 documents in one call
GET /files/v1/jobs	Paginated listing of your jobs, with optional date range
GET /files/v1/jobs/{job_id}	Job status + counters
GET /files/v1/jobs/{job_id}/files	Paginated per-file listing, with optional status filter
GET /files/v1/jobs/{job_id}/files/{custom_id}	Fetch one file by your own `custom_id`

POST /files/v1/jobs

POST api.mathpix.com/files/v1/jobs

Submit up to 200,000 documents for async processing in one call. The request is accept-and-defer:

It returns a job_id and file_count immediately, then submits the items in the background.
file_count echoes the number of items you submitted.
Per-item failures (bad or unsupported source_uri, missing data source, duplicate custom_id) are not reported synchronously. Each surfaces as that file's error status when you poll the job (files_errored, and the ?status=error file listing).
An invalid item never fails the whole call.

Example

{
  "job_id": "contracts-2026-05",
  "files": [
    { "source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1" },
    { "source_uri": "https://example.com/manual.pdf", "custom_id": "manual" }
  ],
  "conversion_formats": { "docx": true, "md": true }
}

curl -X POST https://api.mathpix.com/files/v1/jobs \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{
  "job_id": "contracts-2026-05",
  "files": [
    { "source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1" },
    { "source_uri": "https://example.com/manual.pdf", "custom_id": "manual" }
  ],
  "conversion_formats": { "docx": true, "md": true }
}'

import requests
r = requests.post("https://api.mathpix.com/files/v1/jobs",
    json={
        "job_id": "contracts-2026-05",
        "files": [
            {"source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1"},
            {"source_uri": "https://example.com/manual.pdf", "custom_id": "manual"},
        ],
        "conversion_formats": {"docx": True, "md": True},
    },
    headers={"app_key": "APP_KEY", "Content-Type": "application/json"},
)
print(r.json())  # {"file_count": 2, "job_id": "contracts-2026-05"}

const response = await fetch("https://api.mathpix.com/files/v1/jobs", {
  method: "POST",
  headers: {
    app_key: "APP_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    job_id: "contracts-2026-05",
    files: [
      { source_uri: "s3://customer-bucket/docs/contract-1.pdf", custom_id: "contract-1" },
      { source_uri: "https://example.com/manual.pdf", custom_id: "manual" },
    ],
    conversion_formats: { docx: true, md: true },
  }),
});
const { job_id, file_count } = await response.json();
console.log(`Job ${job_id} accepted ${file_count} files`);

body := bytes.NewBufferString(`{
  "job_id": "contracts-2026-05",
  "files": [
    {"source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1"},
    {"source_uri": "https://example.com/manual.pdf", "custom_id": "manual"}
  ],
  "conversion_formats": {"docx": true, "md": true}
}`)
req, _ := http.NewRequest("POST", "https://api.mathpix.com/files/v1/jobs", body)
req.Header.Set("app_key", "APP_KEY")
req.Header.Set("Content-Type", "application/json")
resp, _ := http.DefaultClient.Do(req)
defer resp.Body.Close()
result, _ := io.ReadAll(resp.Body)
fmt.Println(string(result)) // {"file_count": 2, "job_id": "contracts-2026-05"}

HttpClient client = HttpClient.newHttpClient();
String body = """
    {
      "job_id": "contracts-2026-05",
      "files": [
        {"source_uri": "s3://customer-bucket/docs/contract-1.pdf", "custom_id": "contract-1"},
        {"source_uri": "https://example.com/manual.pdf", "custom_id": "manual"}
      ],
      "conversion_formats": {"docx": true, "md": true}
    }
    """;
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.mathpix.com/files/v1/jobs"))
    .header("app_key", "APP_KEY")
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(body))
    .build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
System.out.println(response.body());

Example response
{
  "file_count": 2,
  "job_id": "contracts-2026-05"
}

Request parameters

In addition to the Files-API-specific parameters below, this endpoint accepts the same OCR and conversion options as POST /v3/pdf (e.g. alphabets_allowed, rm_spaces, math_inline_delimiters, include_smiles). This includes metadata.improve_mathpix, which controls whether Mathpix persists your request data.

Those options apply to every file in the submitted request.
To vary settings across subsets of a larger job, make multiple POST /files/v1/jobs requests to the same job_id with different per-request OCR options.

files FileSubmission[]

List of per-file submissions; each FileSubmission describes one document.

conversion_formats object (optional)

Job-wide conversion formats; applied to every file in the batch. Same shape as POST /v3/pdf, e.g. { "docx": true, "md": true }. No per-item override in v1.

image_output_mode string (optional)

Job-wide. Set to "local" to write cropped images (figures, equation crops, etc.) into each file's destination_uri storage alongside the converted outputs, referenced by relative path from the Mathpix Markdown. Applies only to files that set a destination_uri.

When unset (default), cropped images stay on Mathpix's CDN and are referenced by URL. Same behavior as on POST /files/v1/uri.

job_id string (optional)

Caller-supplied job id. Optional in general: if you omit it the server generates one. It is required whenever any item carries a custom_id (the (job_id, custom_id) idempotency key needs a job id you control; a server-generated one can't be replayed on retry). Also useful when you want predictable job ids for idempotent submission. See Idempotency.

FileSubmission object

source_uri string

Source URI for this file. Same accepted schemes as POST /files/v1/uri: s3://, gs://, public https://, or Azure Blob HTTPS URL. Items can mix schemes within a job.

custom_id string (optional)

Per-item customer-supplied identifier. Max 256 chars, [A-Za-z0-9_\-.:], case-sensitive. Used for correlation in GET /files/v1/jobs/{job_id}/files and for idempotency.

Supplying any custom_id requires an explicit top-level job_id (a server-generated one is not allowed); see the job_id parameter above.

filename string (optional)

Display name for this file, returned in the file listing and used to name outputs. Defaults to <file_id>.pdf when omitted.

destination_uri string (optional)

Per-file destination for this file's results. Requires a registered data source for the bucket. Omit it to keep this file's results retrievable from Mathpix instead. There is no job-wide destination; set it per item.

Output file path convention: each requested format lands at <destination_uri>/<destination_basename>.<ext>, and cropped images (when image_output_mode is "local") under <destination_uri>/images/. Give each file its own prefix so different files' outputs don't share one flat folder.

note

The folder must be short enough that the output object keys derived from it stay within your storage provider's object-key limit:

AWS S3 and Google Cloud Storage: 1024 bytes
Azure Blob Storage: 1024 UTF-16 characters and 254 path segments

A longer folder makes the file reach error status (error_id: destination_uri_too_long).

s3_region string (optional)

Region of this file's destination_uri S3 bucket. Defaults to us-east-1.

destination_basename string (optional)

Basename for this file's output objects within its destination_uri; results land at <destination_uri>/<destination_basename>.<ext>. Defaults to the file_id.

If the resulting key would exceed the provider's object-key limit, the basename falls back to a fixed output. The folder still uniquely identifies the file, so keep basenames short (or omit this field) to preserve your own naming.

page_ranges string (optional)

Process only a subset of pages, e.g. "1-5,8" (same syntax as POST /v3/pdf). Omit to process the whole document.

Response body

file_count int

The number of items you submitted (optimistic; see the note below).

job_id string

The job's id; use it for status and listing.

note

The response returns before the items are processed, so there is no per-item rejected[]. The authoritative completed/errored counts come from the job status (file_count, files_completed, files_errored) once processing runs. Poll it, and use the ?status=error file listing to see exactly which items failed and why.

Failure modes

Situation	Outcome
Whole request invalid (empty / malformed `files`, over the 200,000-item ceiling, or a `job_id` owned by another app)	Synchronous error. The call fails `400` / `403` before the job is accepted; nothing is enqueued.
An individual item is invalid (bad or unsupported `source_uri`, missing `source_uri`, malformed `custom_id`, a `custom_id` duplicated within the batch, or a `destination_uri` too long to keep output keys within the provider's object-key limit)	Per-file `error`. The request already returned `200`; the item is recorded as an errored file in the job (counted in `files_errored`, listed by `?status=error`), not dropped. Other items are unaffected.
A file's `destination_uri` is malformed, or its bucket has no registered data source	Per-file `error`. The file is accepted into the job, then reaches `error` status (counted in `files_errored`) when its source read or result write fails. Other files are unaffected.
Item N's `(job_id, custom_id)` was already submitted and that file is still live	Idempotency hit: counts as a success with the original `file_id`. See Idempotency.

Limits

Per-item input rules (file type, size) follow POST /files/v1/uri.

See Limits and quotas for the full launch envelope and how the monthly page quota is shared with v3/pdf.

Idempotency

A submission is uniquely keyed by (job_id, custom_id). Resubmitting an item with the same pair returns the original file_id rather than creating a new submission. This makes retries safe.

Both parts are required to opt in: supply your own job_id and a per-item custom_id on the original call. Because custom_id requires an explicit job_id, idempotency is never available on a server-generated job id (you wouldn't know it to replay on retry).
If custom_id is absent, idempotency does not apply; each call creates a new submission.
Only a live file (pending, split, or completed) counts as a hit. If the original file reached error status or was deleted, the same pair is a miss and resubmission creates a fresh file, so you can retry failures with the same ids.

Whole-batch idempotency via `Idempotency-Key`

If you don't supply your own job_id, you can still make the entire batch submission safe to retry by sending an Idempotency-Key request header (same constraints as custom_id: max 256 chars, [A-Za-z0-9_\-.:]).

The server derives a deterministic job_id from your app_key + the key.
Re-sending the same request (same key, same files) returns the original { "job_id", "file_count" } without re-enqueuing any file.
Use this to safely retry a POST /files/v1/jobs call that timed out or whose response you never received.
The header is honored only when no job_id is supplied. If you send an explicit job_id, it wins and the header is ignored for job derivation.
This is batch-level dedup (the whole submission), distinct from the per-item (job_id, custom_id) dedup above. The two compose: within an idempotent batch, per-item custom_ids still dedup individual files.

For single-item submissions via POST /files/v1/uri without a job_id, the same Idempotency-Key header works at the single-file level.

GET /files/v1/jobs

GET api.mathpix.com/files/v1/jobs

Paginated listing of the jobs submitted under your account, newest first.

Use it to recover a job_id you no longer have, or to enumerate jobs in a date range.

Query parameters

start string (optional)

Earliest submission date to include, yyyy-MM-dd (UTC). Providing only one of start / end queries that single day.

end string (optional)

Latest submission date to include, yyyy-MM-dd (UTC).

limit int (optional)

Maximum jobs per page, 1–1000. Defaults to 100.

paging_state string (optional)

Opaque pagination cursor from the previous response's next_page_token.

Example

cURL
Python
JavaScript / TypeScript
Go
Java

curl -H 'app_key: APP_KEY' \
  'https://api.mathpix.com/files/v1/jobs?start=2026-07-01&end=2026-07-31'

import requests
r = requests.get("https://api.mathpix.com/files/v1/jobs",
    params={"start": "2026-07-01", "end": "2026-07-31"},
    headers={"app_key": "APP_KEY"},
)
print(r.json())

const response = await fetch(
  "https://api.mathpix.com/files/v1/jobs?start=2026-07-01&end=2026-07-31",
  { headers: { app_key: "APP_KEY" } },
);
const { jobs, next_page_token } = await response.json();

req, _ := http.NewRequest("GET", "https://api.mathpix.com/files/v1/jobs?start=2026-07-01&end=2026-07-31", nil)
req.Header.Set("app_key", "APP_KEY")
resp, _ := http.DefaultClient.Do(req)
defer resp.Body.Close()
result, _ := io.ReadAll(resp.Body)
fmt.Println(string(result))

HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.mathpix.com/files/v1/jobs?start=2026-07-01&end=2026-07-31"))
    .header("app_key", "APP_KEY")
    .GET()
    .build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
System.out.println(response.body());

Example response
{
  "jobs": [
    { "job_id": "contracts-2026-05", "created_at": "2026-07-21T14:12:18.650Z" },
    { "job_id": "my-batch-2026-07", "created_at": "2026-07-21T13:45:07.007Z" },
    { "job_id": "7e2a55d9-3a51-4d2c-9c8a-2c1f3e4f5d6b", "created_at": "2026-07-21T13:45:06.977Z" }
  ],
  "next_page_token": null
}

Response body

jobs JobSummary[]

Job entries, newest first.

next_page_token string (optional)

Non-null when more pages remain; pass it back as paging_state to get the next page.

JobSummary object

job_id string

The job's identifier: the one you supplied at submission, or server-generated. Fetch the job's status and counters with GET /files/v1/jobs/{job_id}.

created_at string

Submission time (ISO 8601, UTC).

note

A malformed date (use yyyy-MM-dd), an end before start, a limit outside 1–1000, or an invalid paging_state is rejected with 400 bad_request.

GET /files/v1/jobs/{job_id}

GET api.mathpix.com/files/v1/jobs/{job_id}

Returns the job's status and counters.

Poll this endpoint to detect job completion.

Example

cURL
Python
JavaScript / TypeScript
Go
Java

curl -H 'app_key: APP_KEY' \
  https://api.mathpix.com/files/v1/jobs/contracts-2026-05

import requests
r = requests.get("https://api.mathpix.com/files/v1/jobs/contracts-2026-05",
    headers={"app_key": "APP_KEY"},
)
print(r.json())

const response = await fetch("https://api.mathpix.com/files/v1/jobs/contracts-2026-05", {
  headers: { app_key: "APP_KEY" },
});
const job = await response.json();
console.log(job.status, job.files_completed, job.files_errored);

req, _ := http.NewRequest("GET", "https://api.mathpix.com/files/v1/jobs/contracts-2026-05", nil)
req.Header.Set("app_key", "APP_KEY")
resp, _ := http.DefaultClient.Do(req)
defer resp.Body.Close()
result, _ := io.ReadAll(resp.Body)
fmt.Println(string(result))

HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.mathpix.com/files/v1/jobs/contracts-2026-05"))
    .header("app_key", "APP_KEY")
    .GET()
    .build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
System.out.println(response.body());

Example response
{
  "job_id": "contracts-2026-05",
  "status": "completed",
  "file_count": 2,
  "files_completed": 1,
  "files_errored": 1,
  "num_pages_sent": 30,
  "num_pages_completed": 18,
  "created_at": "2026-07-21T14:12:18.650Z",
  "modified_at": "2026-07-21T14:13:02.114Z"
}

Response body

job_id string

The job's identifier.

status string

"processing" while any file is still pending; "completed" when every file has reached a terminal state.

file_count int

Total files accepted into this job.

files_completed int

Files that have produced final results.

files_errored int

Files in terminal error state.

num_pages_sent int

Pages discovered so far across the job's documents. A document contributes its page count only once it is opened and split for processing, so this grows during processing and trails file_count until every document has been split.

num_pages_completed int

Pages that have produced final results.

created_at string

Job creation time (ISO 8601, UTC).

modified_at string

Last state change (ISO 8601, UTC).

note

Job is complete when status == "completed". Per-file failures don't fail the job; they're counted in files_errored. Use GET /files/v1/jobs/{job_id}/files?status=error to enumerate them.

GET /files/v1/jobs/{job_id}/files

GET api.mathpix.com/files/v1/jobs/{job_id}/files

Paginated listing of files in the job.

Supports a status filter (e.g. for "show me only the errored ones").

Query parameters

status string (optional)

Filter to one of pending, completed, error. Omit to list all. pending covers every file that has not reached a terminal state, including files currently being processed. Any other value returns 400 bad_request with the message status must be one of: pending, error, completed.

paging_state string (optional)

Opaque pagination cursor from the previous response's next_page_token.

limit int (optional)

Maximum items per page.

Example

cURL
Python
JavaScript / TypeScript
Go
Java

curl -H 'app_key: APP_KEY' \
  'https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files?status=error'

import requests
r = requests.get("https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files",
    params={"status": "error"},
    headers={"app_key": "APP_KEY"},
)
print(r.json())

const response = await fetch(
  "https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files?status=error",
  { headers: { app_key: "APP_KEY" } },
);
const { files, next_page_token } = await response.json();

req, _ := http.NewRequest("GET", "https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files?status=error", nil)
req.Header.Set("app_key", "APP_KEY")
resp, _ := http.DefaultClient.Do(req)
defer resp.Body.Close()
result, _ := io.ReadAll(resp.Body)
fmt.Println(string(result))

HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files?status=error"))
    .header("app_key", "APP_KEY")
    .GET()
    .build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
System.out.println(response.body());

Example response
{
  "files": [
    {
      "file_id": "f7d3a210-6c4e-49f3-bd5e-8e1c2f4d6b9a",
      "filename": "f7d3a210-6c4e-49f3-bd5e-8e1c2f4d6b9a.pdf",
      "status": "error",
      "custom_id": "manual"
    }
  ],
  "next_page_token": null
}

Response body

files JobFile[]

File entries.

next_page_token string (optional)

Non-null when more pages remain; pass it back as paging_state to get the next page.

JobFile object

file_id string

The file's Mathpix id; usable with GET /files/v1/{file_id}.

filename string

The display name supplied at submit, or <file_id>.pdf when none was.

status string

One of pending, completed, error. Populated when the status filter is used; null in the unfiltered listing.

custom_id string

The per-item identifier supplied at submit, or null.

Failed-files lookup

The common pattern after a job completes:

import requests
files, page = [], None
while True:
    params = {"status": "error"}
    if page: params["paging_state"] = page
    r = requests.get(f"https://api.mathpix.com/files/v1/jobs/{job_id}/files",
                     params=params, headers={"app_key": "APP_KEY"})
    body = r.json()
    files.extend(body["files"])
    page = body.get("next_page_token")
    if not page: break
print(f"{len(files)} files errored. custom_ids: {[f['custom_id'] for f in files]}")

Use the returned custom_ids to drive your retry logic against the original input set.

GET /files/v1/jobs/{job_id}/files/{custom_id}

GET api.mathpix.com/files/v1/jobs/{job_id}/files/{custom_id}

Fetch a single file by the (job_id, custom_id) you supplied at submit; no need to track our file_id.

Returns the same body as GET /files/v1/{file_id}, at any status.

Example

cURL
Python
JavaScript / TypeScript
Go
Java

curl -H 'app_key: APP_KEY' \
  'https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files/contract-1'

import requests
r = requests.get("https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files/contract-1",
    headers={"app_key": "APP_KEY"},
)
print(r.json())

const response = await fetch(
  "https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files/contract-1",
  { headers: { app_key: "APP_KEY" } },
);
const file = await response.json();
console.log(file.file_id, file.status);

req, _ := http.NewRequest("GET", "https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files/contract-1", nil)
req.Header.Set("app_key", "APP_KEY")
resp, _ := http.DefaultClient.Do(req)
defer resp.Body.Close()
result, _ := io.ReadAll(resp.Body)
fmt.Println(string(result))

HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.mathpix.com/files/v1/jobs/contracts-2026-05/files/contract-1"))
    .header("app_key", "APP_KEY")
    .GET()
    .build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
System.out.println(response.body());

Example response
{
  "percent_done": 100.0,
  "formats": {},
  "custom_id": "contract-1",
  "num_pages": 30,
  "destination_uri": null,
  "destination_basename": null,
  "filename": "2e60c261-8a0d-4425-9f2c-7d1e4b6a5c38.pdf",
  "format_primary": "mmd",
  "file_id": "2e60c261-8a0d-4425-9f2c-7d1e4b6a5c38",
  "num_pages_completed": 30,
  "status": "completed"
}

The body is the file-status object of GET /files/v1/{file_id}; see that reference for the field list.

note

Both job_id and custom_id are required. An unknown (job_id, custom_id), or one belonging to another account, returns 404 not_found (the two cases are indistinguishable by design). This lookup is available only when the original submission supplied your own job_id and a per-item custom_id (the same pair used for idempotency).

POST /files/v1/jobs​

Example​

Request parameters​

FileSubmission object​

Response body​

Failure modes​

Limits​

Idempotency​

Whole-batch idempotency via Idempotency-Key​

GET /files/v1/jobs​

Query parameters​

Example​

Response body​

JobSummary object​

GET /files/v1/jobs/{job_id}​

Example​

Response body​

GET /files/v1/jobs/{job_id}/files​

Query parameters​

Example​

Response body​

JobFile object​

Failed-files lookup​

GET /files/v1/jobs/{job_id}/files/{custom_id}​

Example​

See also​

POST /files/v1/jobs

Example

Request parameters

FileSubmission object

Response body

Failure modes

Limits

Idempotency

Whole-batch idempotency via `Idempotency-Key`

GET /files/v1/jobs

Query parameters

Example

Response body

JobSummary object

GET /files/v1/jobs/{job_id}

Example

Response body

GET /files/v1/jobs/{job_id}/files

Query parameters

Example

Response body

JobFile object

Failed-files lookup

GET /files/v1/jobs/{job_id}/files/{custom_id}

Example

See also