Skip to main content

Process a PDF

Submit a PDF (or EPUB, DOCX, PPTX, and other document formats) for OCR processing. Results are available as Mathpix Markdown, DOCX, LaTeX, HTML, and more.

info

PDF processing is asynchronous. You submit the document, then poll for status and download results when complete. For real-time partial results, use the streaming option.

Submit via URL

from mpxpy.mathpix_client import MathpixClient
client = MathpixClient(app_id="APP_ID", app_key="APP_KEY")
pdf = client.pdf_new(
url="https://cdn.mathpix.com/examples/cs229-notes1.pdf",
convert_to_docx=True,
convert_to_tex_zip=True,
)
print(pdf.pdf_id)

Submit via file upload

from mpxpy.mathpix_client import MathpixClient
client = MathpixClient(app_id="APP_ID", app_key="APP_KEY")
pdf = client.pdf_new(
file_path="document.pdf",
convert_to_docx=True,
convert_to_tex_zip=True,
)
print(pdf.pdf_id)

Poll processing status

After submitting, poll until status is "completed":

# wait_until_complete handles polling automatically
pdf.wait_until_complete(timeout=60)
print(pdf.pdf_status())

Response while processing:

{
"status": "split",
"num_pages": 12,
"num_pages_completed": 4,
"percent_done": 33.33
}

Response when complete:

{
"status": "completed",
"num_pages": 12,
"num_pages_completed": 12,
"percent_done": 100
}

Download results

Once processing is complete, download results by appending the format extension:

# Save to files
pdf.to_md_file(path="result.mmd")
pdf.to_docx_file(path="result.docx")
pdf.to_tex_zip_file(path="result.tex.zip")
pdf.to_lines_json_file(path="lines.json")
# Or get content in memory
md_text = pdf.to_md_text() # str
docx_bytes = pdf.to_docx_bytes() # bytes
lines = pdf.to_lines_json() # dict

Check conversion status

If you requested conversion_formats, check their status separately:

print(pdf.pdf_conversion_status())

Response:

{
"status": "completed",
"conversion_status": {
"docx": { "status": "completed" },
"tex.zip": { "status": "completed" }
}
}

Stream pages

For lower time-to-first-data, enable streaming to receive page results via server-sent events (SSE) as each page completes:

# 1. Submit with streaming enabled
curl -X POST https://api.mathpix.com/v3/pdf \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{"url": "https://cdn.mathpix.com/examples/cs229-notes1.pdf", "streaming": true}'

# 2. Connect to the SSE stream
curl https://api.mathpix.com/v3/pdf/PDF_ID/stream \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY'

Pages are streamed one JSON object at a time. Pages are not guaranteed to be in order, although they generally will be.

Process specific pages

Use page_ranges to process only certain pages:

{
"url": "https://cdn.mathpix.com/examples/cs229-notes1.pdf",
"page_ranges": "2,4-6"
}

This selects pages [2, 4, 5, 6]. You can also use negative indices: "2 - -2" selects all pages from the second to the next-to-last.

Delete results

Permanently delete a PDF's output data when you no longer need it:

curl -X DELETE https://api.mathpix.com/v3/pdf/PDF_ID \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY'
warning

Download and store files locally before deleting if you need to keep them. Deletion is permanent.

Supported formats

Input: PDF, EPUB, DOCX, PPTX, AZW/AZW3/KFX, MOBI, DJVU, DOC, WPD, ODT

Output: MMD, MD, DOCX, LaTeX zip, HTML, PDF (with HTML or LaTeX rendering), PPTX, and ZIP variants with images

Next steps