Skip to main content

Process Documents

Submit documents for OCR, check processing status, stream results, and download output in multiple formats.

EndpointDescription
POST v3/pdfSubmit a document for async OCR processing
GET v3/pdf/{pdf_id}/streamStream page results via SSE
GET v3/pdf/{pdf_id}Check processing status
GET v3/converter/{pdf_id}Check conversion format status
GET v3/pdf/{pdf_id}.{ext}Download results in a specific format
DELETE v3/pdf/{pdf_id}Permanently delete output data

See the PDF processing guide for step-by-step examples.

POST v3/pdf

POST api.mathpix.com/v3/pdf

Process PDFs, ebooks, and documents asynchronously. Returns a pdf_id for polling status and downloading results. Maximum file size: 1 GB.

Supported inputs (full list):

  • Documents: PDF, DOCX, PPTX, DOC, WPD, ODT
  • Ebooks: EPUB, AZW/AZW3/KFX, MOBI, DJVU

Supported outputs (full list):

  • Text: MMD, MD, HTML, LaTeX (.tex.zip)
  • Office: DOCX, PPTX
  • PDF: HTML-rendered or LaTeX-rendered
  • Archives: ZIP variants with embedded images

Example

curl -X POST https://api.mathpix.com/v3/pdf \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{"url": "https://cdn.mathpix.com/examples/cs229-notes1.pdf", "conversion_formats": {"docx": true, "tex.zip": true}}'
Example response
{
"pdf_id": "2024_01_15_abc123def456"
}

Request parameters

You can either send a file URL in a JSON body, or upload a file via multipart form-data (with parameters in options_json).

url string (optional)

HTTP URL where the file can be downloaded from

streaming bool (optional), default value is false

Whether streaming should be enabled for this request, see stream pdf pages.

metadata object (optional)

Key-value object. Supports improve_mathpix for extra privacy controls.

alphabets_allowed AlphabetsAllowed (optional)

Specify which alphabets you don't want in the output

rm_spaces bool (optional), default value is true

Determines whether extra white space is removed from equations in latex_styled and text formats.

rm_fonts bool (optional), default value is false

Determines whether font commands such as \mathbf and \mathrm are removed from equations in latex_styled and text formats.

idiomatic_eqn_arrays bool (optional), default value is false

Specifies whether to use aligned, gathered, or cases instead of an array environment for a list of equations.

include_equation_tags bool (optional)

Specifies whether to include equation number tags inside equations LaTeX. When set to true, it sets "idiomatic_eqn_arrays": true, because equation numbering works better in those environments compared to the array environment.

Example

\tag{eq_number}, where eq_number is an equation number (e.g. 1.12)

include_smiles bool (optional), default value is true

Enable experimental chemistry diagram OCR, via RDKIT normalized SMILES with isomericSmiles=False, included in text output format, via MMD SMILES syntax <smiles>...</smiles>.

include_chemistry_as_image bool (optional), default value is false

Returns an image crop containing the SMILES in the alt-text for chemical diagrams.

Example

![<smiles>CCC</smiles>](https://cdn.mathpix.com/cropped/image_id.jpg)

include_diagram_text bool (optional), default value is false

Enables text extraction from diagrams. The extracted text will be part of lines.json data, and not part of the lines.mmd.json or final mmd. The parent_id of these text lines will correspond to the id of one of the diagrams in the lines.json data.

include_page_info bool (optional), default value is false

Controls whether page info elements are included in the final MMD output. Page info refers to elements like headers, footers, and page numbers that are not part of the main text (unlike v3/text where it defaults to true).

numbers_default_to_math bool (optional), default value is false

Specifies whether numbers are always math.

Example

Answer: \( 17 \) instead of Answer: 17

math_inline_delimiters [string, string] (optional), default value is ["\\(", "\\)"]

Specifies begin inline math and end inline math delimiters for text outputs.

math_display_delimiters [string, string] (optional), default value is ["\\[", "\\]"]

Specifies begin display math and end display math delimiters for text outputs.

page_ranges string

Specifies a page range as a comma-separated string.

Examples

  • 2,4-6 selects pages [2,4,5,6]
  • 2 - -2 selects all pages starting with the second page and ending with the next-to-last page (specified by -2)
enable_spell_check bool (deprecated)

Deprecated, has no effect on the output.

auto_number_sections bool, default value is false

Specifies whether sections and subsections in the output are automatically numbered note

remove_section_numbering bool, default value is false

Specifies whether to remove existing numbering for sections and subsections note

preserve_section_numbering bool, default value is true

Specifies whether to keep existing section numbering as is note

enable_tables_fallback bool, default value is false

Enables advanced table processing algorithm that supports large and complex tables.

fullwidth_punctuation bool (optional), default value is null

Controls if punctuation will be fullwidth Unicode (default for east Asian languages like Chinese), or halfwidth Unicode (default for Latin scripts, Cyrillic scripts etc.). When null, fullwidth vs halfwidth will be decided based on image content. Punctuation inside math will always stay halfwidth.

conversion_formats ConversionFormats

Specifies formats that the v3/pdf output (Mathpix Markdown) should automatically be converted into on completion.

conversion_options ConversionOptions (optional)

Options for specific output formats (e.g. font, margins, orientation for DOCX).

Section numbering

warning

Only one of auto_number_sections, remove_section_numbering, or preserve_section_numbering can be true at a time. The default behavior is to preserve section numbering (preserve_section_numbering set to true). Setting multiple flags returns error opts_section_numbering.

Response body

pdf_id string

Tracking ID to get status and result when completed

error string (optional)

US locale error message

error_info ErrorInfo (optional)

Error info object

GET v3/pdf/{pdf_id}/stream

Stream page results via server-sent events (SSE) for lower time to first data. Requires streaming: true in the initial POST request.

GET api.mathpix.com/v3/pdf/{pdf_id}/stream

Example

curl -N https://api.mathpix.com/v3/pdf/PDF_ID/stream \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY' \
-H 'Accept: text/event-stream'
text string

Mathpix Markdown output

page_idx number

page index from selected page range, starting at 1 and going all the way to pdf_selected_len

pdf_selected_len number

total number of pages inside selected page range

confidence number

overall confidence score for the page (0–1)

confidence_rate number

rate of high-confidence characters on the page (0–1)

version string

model version used for processing (e.g. SuperNet-109p4)

Pages are streamed one JSON object at a time. Pages are not guaranteed to be in order, although they generally will be.

GET v3/pdf/{pdf_id}

Check the processing status of a PDF.

GET api.mathpix.com/v3/pdf/{pdf_id}

Example

curl https://api.mathpix.com/v3/pdf/PDF_ID \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY'
pdf_id string

The PDF tracking ID

status string
ValueMeaning
receivedRequest accepted
loadedPDF downloaded onto our servers
splitPages split and sent for processing
completedProcessing finished successfully
errorA problem occurred during processing
num_pages integer (optional)

Total number of pages in PDF document

num_pages_completed integer (optional)

Current number of pages in PDF document that have been OCR-ed

percent_done number (optional)

Percentage of pages in PDF that have been OCR-ed

app_id string (optional)

The app ID that submitted the request

group_id string (optional)

The group ID associated with the request

input_file string (optional)

Original filename of the uploaded document

conversion_status ConversionStatus (object) (optional)

Status of each requested conversion format.

GET v3/converter/{pdf_id}

Check the status of requested conversion formats.

GET api.mathpix.com/v3/converter/{pdf_id}

Example

curl https://api.mathpix.com/v3/converter/PDF_ID \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY'
status string

Always completed once the PDF has finished processing. Individual format progress is tracked in conversion_status.

conversion_status ConversionStatus (object) (optional)

Status of each requested conversion format.

GET v3/pdf/{pdf_id}.{ext}

Download results by appending the format extension to the pdf_id. Results are available once status=completed. Conversion formats (e.g., docx) require the format's conversion status to be completed.

GET api.mathpix.com/v3/pdf/{pdf_id}.{ext}

Example

curl https://api.mathpix.com/v3/pdf/PDF_ID.docx \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY' \
-o output.docx

Accepted extensions: .mmd, .md, .docx, .tex.zip, .html, .pdf, .latex.pdf, .pptx, .mmd.zip, .md.zip, .html.zip, .lines.json, .lines.mmd.json

See Conversion Formats for availability and descriptions. The lines.json and lines.mmd.json extensions are v3/pdf-only — see PDF lines data and PDF MMD lines data.

PDF lines data

Detailed line-by-line data for PDFs, useful for building custom experiences on top of original PDFs.

Response data object

List of PdfPageData objects

PdfPageData object

image-id string

PDF ID, plus hyphen, plus page number, starting at page 1

page integer

Page number

List of LineData objects

page_height integer

Page height (in pixel coordinates)

page_width integer

Page width (in pixel coordinates)

PdfLineData object

id string

Unique line identifier

parent_id string (optional)

Unique line identifier of the parent.

children_ids string[] (optional)

List of children unique identifiers.

type string

See line types and subtypes for details.

subtype string (optional)

See line types and subtypes for details.

line integer

Line number

text string

Searchable text, empty string for page elements that do not necessarily have associated text (for example individual equations inside block of math equations).

text_display string

Mathpix Markdown content with contextual elements such as article, section and inline image URLs. Can be empty for page elements that will not render (for example page number, auxiliary text in the page header, etc.).

conversion_output boolean

When true, text_display from the line is included in the final MMD output, otherwise excluded.

is_printed boolean

True if line contains printed text, false otherwise.

is_handwritten boolean

True if line contains handwritten text, false otherwise.

region Region

Bounding box of the line in pixel coordinates

cnt [[x,y]]

Specifies the image area as list of (x,y) pixel coordinate pairs. This captures handwritten content much better than a region object

confidence number in [0,1]

Estimated probability 100% correct (product of per token OCR confidence).

confidence_rate number in [0,1]

Estimated confidence of output quality (geometric mean of per token OCR confidence).

PDF MMD lines data (deprecated)

warning

Deprecated. Use lines.json instead, which contains all this information and more.

Response data object (MMD Lines)

List of PdfMMDPageData objects

PdfMMDPageData object

image-id string

PDF ID, plus hyphen, plus page number, starting at page 1

page integer

Page number

List of PageMMDLineData objects

page_height integer

Page height (in pixel coordinates)

page_width integer

Page width (in pixel coordinates)

PdfMMDLineData object

line integer

Line number

text string

Mathpix Markdown content with contextual elements such as article, section and inline image URLs

is_printed boolean

True if line contains printed text, false otherwise.

is_handwritten boolean

True if line contains handwritten text, false otherwise.

region Region

Bounding box of the line in pixel coordinates

cnt [[x,y]]

Specifies the image area as list of (x,y) pixel coordinate pairs. This captures handwritten content much better than a region object

confidence number in [0,1]

Estimated probability 100% correct (product of per token OCR confidence).

confidence_rate number in [0,1]

Estimated confidence of output quality (geometric mean of per token OCR confidence).

DELETE v3/pdf/{pdf_id}

Permanently delete a PDF's output data.

DELETE api.mathpix.com/v3/pdf/{pdf_id}

Example

curl -X DELETE https://api.mathpix.com/v3/pdf/PDF_ID \
-H 'app_id: APP_ID' \
-H 'app_key: APP_KEY'

When a PDF is deleted:

  • All output files are permanently removed from our servers (MMD, images, JSON Lines, and all requested formats)
  • The original input file is also deleted
  • This deletion is permanent and cannot be undone
warning

Download and store files locally before deleting if you need to keep them.

note

PDF page images and cropped images (figures, diagrams) served via CDN may remain accessible for up to 24 hours after deletion while cached copies expire.

Minimal metadata is retained for auditing and billing: status, input_file name, num_pages, timestamps, and processing version. No output content is stored.

warning

If privacy is a concern, rename the file before upload to avoid storing identifiable filenames.

Response body

Returns the PDF status object at the time of deletion.

pdf_id string

The PDF tracking ID

app_id string (optional)

The app ID that submitted the request

group_id string (optional)

The group ID associated with the request

status string

Processing status at time of deletion (e.g. completed)

input_file string (optional)

Original filename of the uploaded document

num_pages integer (optional)

Total number of pages in PDF document

num_pages_completed integer (optional)

Number of pages that were OCR-ed

percent_done number (optional)

Percentage of pages that were OCR-ed

conversion_status ConversionStatus (object) (optional)

Status of each requested conversion format

After deletion, subsequent GET requests to v3/pdf/{pdf_id} return the same status object with an additional deleted_at field:

deleted_at string (optional)

ISO 8601 timestamp of when the PDF was deleted. Only present on GET requests after deletion.