Skip to main content

Process a PDF

Submit a PDF (or EPUB, DOCX, PPTX, and other document formats) for OCR processing. Results are available as Mathpix Markdown, DOCX, LaTeX, HTML, and more.

info

PDF processing is asynchronous. You submit the document, then poll for status and download results when complete. For real-time partial results, use the streaming option.

Submit via URL

Submit a document URL to the v3/pdf endpoint:

{
"url": "https://cdn.mathpix.com/examples/cs229-notes1.pdf",
"conversion_formats": { "docx": true, "tex.zip": true }
}
Example response
{
"pdf_id": "2024_01_15_abc123def456"
}

Use the pdf_id to poll processing status, download results, and delete results.

Submit via file upload

Upload a document file to the v3/pdf endpoint via multipart form-data:

from mpxpy.mathpix_client import MathpixClient
client = MathpixClient(app_id="APP_ID", app_key="APP_KEY")
pdf = client.pdf_new(
file_path="document.pdf",
convert_to_docx=True,
convert_to_tex_zip=True,
)
print(pdf.pdf_id)
Example response
{
"pdf_id": "2024_01_15_abc123def456"
}

Use the pdf_id to poll processing status, download results, and delete results.

Poll processing status

After submitting, poll GET v3/pdf/{pdf_id} until status is "completed":

# wait_until_complete handles polling automatically
pdf.wait_until_complete(timeout=60)
print(pdf.pdf_status())
Example response while processing
{
"status": "split",
"num_pages": 12,
"num_pages_completed": 4,
"percent_done": 33.33
}
Example response when complete
{
"status": "completed",
"num_pages": 12,
"num_pages_completed": 12,
"percent_done": 100
}

Download results

Once processing is complete, download results from GET v3/pdf/{pdf_id}.{ext} by appending the format extension:

# Save to files
pdf.to_md_file(path="result.mmd")
pdf.to_docx_file(path="result.docx")
pdf.to_tex_zip_file(path="result.tex.zip")
pdf.to_lines_json_file(path="lines.json")
# Or get content in memory
md_text = pdf.to_md_text() # str
docx_bytes = pdf.to_docx_bytes() # bytes
lines = pdf.to_lines_json() # dict

Check conversion status

If you requested conversion_formats, check their status separately:

print(pdf.pdf_conversion_status())
Example conversion status response
{
"status": "completed",
"conversion_status": {
"docx": { "status": "completed" },
"tex.zip": { "status": "completed" }
}
}

Stream pages

For lower time-to-first-data, enable the streaming request parameter and connect to GET v3/pdf/{pdf_id}/stream to receive page results via server-sent events (SSE) as each page completes:

# 1. Submit with streaming enabled
curl -X POST https://api.mathpix.com/v3/pdf \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{"url": "https://cdn.mathpix.com/examples/cs229-notes1.pdf", "streaming": true}'

# 2. Connect to the SSE stream
curl https://api.mathpix.com/v3/pdf/PDF_ID/stream \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY'

Pages are streamed one JSON object at a time. Pages are not guaranteed to be in order, although they generally will be.

Process specific pages

Use the page_ranges request parameter to process only certain pages.

{
"url": "https://cdn.mathpix.com/examples/cs229-notes1.pdf",
"page_ranges": "2,4-6"
}
Example response
{
"pdf_id": "2024_01_15_abc123def456"
}

The value "2,4-6" selects pages [2, 4, 5, 6]. You can also use negative indices: "2 - -2" selects all pages from the second to the next-to-last.

Delete results

Permanently delete a PDF's output data via DELETE v3/pdf/{pdf_id} when you no longer need it:

curl -X DELETE https://api.mathpix.com/v3/pdf/PDF_ID \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY'
warning

Download and store files locally before deleting if you need to keep them. Deletion is permanent.

note

PDF page images and cropped images (figures, diagrams) served via CDN may remain accessible for up to 24 hours after deletion while cached copies expire.

Supported formats

See Supported Formats for the full list of accepted input and output formats.

Next steps