POST v3/pdf
POST api.mathpix.com/v3/pdf
Process PDFs, ebooks, and documents asynchronously. Returns a pdf_id for polling status and downloading results. Maximum file size: 1 GB.
See the PDF processing guide for step-by-step examples.
Supported inputs: PDF, EPUB, DOCX, PPTX, AZW/AZW3/KFX, MOBI, DJVU, DOC, WPD, ODT
Supported outputs: MMD, MD, DOCX, LaTeX zip, HTML, PDF (HTML or LaTeX rendering), PPTX, and ZIP variants with images
Request parameters
You can either send a file URL in a JSON body, or upload a file via multipart form-data (with parameters in options_json).
url HTTP URL where the file can be downloaded from
streaming Whether streaming should be enabled for this request, see stream pdf pages.
metadata Key value object
alphabets_allowed See AlphabetsAllowed section, use this to specify which alphabets you don't want in the output
rm_spaces Determines whether extra white space is removed from equations in latex_styled and text formats.
rm_fonts Determines whether font commands such as \mathbf and \mathrm are removed from equations in latex_styled and text formats.
idiomatic_eqn_arrays Specifies whether to use aligned, gathered, or cases instead of an array environment for a list of equations.
include_equation_tags Specifies whether to include equation number tags inside equations LaTeX. When set to true, it sets "idiomatic_eqn_arrays": true, because equation numbering works better in those environments compared to the array environment.
Example
\tag{eq_number}, where eq_number is an equation number (e.g. 1.12)
include_smiles Enable experimental chemistry diagram OCR, via RDKIT normalized SMILES with isomericSmiles=False, included in text output format, via MMD SMILES syntax <smiles>...</smiles>.
include_chemistry_as_image Returns an image crop containing the SMILES in the alt-text for chemical diagrams.
Example

include_diagram_text Enables text extraction from diagrams. The extracted text will be part of lines.json data, and not part of the lines.mmd.json or final mmd. The parent_id of these text lines will correspond to the id of one of the diagrams in the lines.json data.
include_page_info Controls whether page info elements are included in the final MMD output. Page info refers to various elements that are not part of the main text (unlike v3/text where it defaults to true).
numbers_default_to_math math_inline_delimiters Specifies begin inline math and end inline math delimiters for text outputs.
math_display_delimiters Specifies begin display math and end display math delimiters for text outputs.
page_ranges Specifies a page range as a comma-separated string.
Examples
2,4-6selects pages [2,4,5,6]2 - -2selects all pages starting with the second page and ending with the next-to-last page (specified by -2)
enable_spell_check Deprecated, has no effect on the output.
auto_number_sections Specifies whether sections and subsections in the output are automatically numbered note
remove_section_numbering Specifies whether to remove existing numbering for sections and subsections note
preserve_section_numbering Specifies whether to keep existing section numbering as is note
enable_tables_fallback Enables advanced table processing algorithm that supports very large and complex tables.
fullwidth_punctuation Controls if punctuation will be fullwidth Unicode (default for east Asian languages like Chinese), or halfwidth Unicode (default for Latin scripts, Cyrillic scripts etc.). When null, fullwidth vs halfwidth will be decided based on image content. Punctuation inside math will always stay halfwidth.
conversion_formats Specifies formats that the v3/pdf output (Mathpix Markdown) should automatically be converted into on completion. See Conversion Formats.
conversion_options Options for specific output formats (e.g. font, margins, orientation for DOCX). See Conversion Options.
Section numbering
Only one of auto_number_sections, remove_section_numbering, or preserve_section_numbering can be true at a time. The default behavior is to preserve section numbering (preserve_section_numbering set to true). Setting multiple flags returns error opts_section_numbering.
Response body
pdf_id Tracking ID to get status and result when completed
error US locale error message
error_info Error info object
GET v3/pdf/{pdf_id}/stream
Stream page results via server-sent events (SSE) for lower time to first data. Requires streaming: true in the initial POST request.
GET api.mathpix.com/v3/pdf/{pdf_id}/stream
text Mathpix Markdown output
page_idx page index from selected page range, starting at 1 and going all the way to pdf_selected_len
pdf_selected_len total number of pages inside selected page range
confidence overall confidence score for the page (0–1)
confidence_rate rate of high-confidence characters on the page (0–1)
version model version used for processing (e.g. SuperNet-109p4)
Pages are streamed one JSON object at a time. Pages are not guaranteed to be in order, although they generally will be.
GET v3/pdf/{pdf_id}
Check the processing status of a PDF.
GET api.mathpix.com/v3/pdf/{pdf_id}
pdf_id The PDF tracking ID
status Processing status, will be received upon successful request, loaded if PDF was down-loaded onto our servers, split when PDF pages are split and sent for processing, completed when PDF is done processing, or error if a problem occurs during processing
num_pages Total number of pages in PDF document
num_pages_completed Current number of pages in PDF document that have been OCR-ed
percent_done Percentage of pages in PDF that have been OCR-ed
app_id The app ID that submitted the request
group_id The group ID associated with the request
input_file Original filename of the uploaded document
conversion_status Status of each requested conversion format.
GET v3/converter/{pdf_id}
Check the status of requested conversion formats.
GET api.mathpix.com/v3/converter/{pdf_id}
status completed for an existing mmd document
conversion_status Status of each requested conversion format.
GET v3/pdf/{pdf_id}.{ext}
Download results by appending the format extension to the pdf_id. Results are available once status=completed. Conversion formats (e.g., docx) require the format's conversion status to be completed.
GET api.mathpix.com/v3/pdf/{pdf_id}.{ext}
| Extension | Description |
|---|---|
| mmd | Returns Mathpix Markdown text file |
| md | Returns plain Markdown text file |
| docx | Returns a docx file |
| latex.pdf | Returns a PDF file with LaTeX rendering |
| Returns a PDF file with HTML rendering | |
| html | Returns a HTML file with the rendered Mathpix Markdown content |
| lines.json | Returns line by line data |
| lines.mmd.json | Returns line by line mmd data, deprecated please use lines.json which contains all this information and more. |
| pptx | Returns a pptx file |
| tex.zip | Returns a LaTeX zip file containing the .tex file and any images that appear in the document |
| mmd.zip | Returns a MMD zip file containing a Mathpix Markdown text file and any images that appear in the document |
| md.zip | Returns a MD zip file containing a Markdown text file and any images that appear in the document |
| html.zip | Returns a HTML zip file containing a HTML file and any images that appear in the document |
For ZIP formats, images will be referenced as follows:
tex.zip:\includegraphics[max width=\textwidth]{2024_12_15_dfc981061e9740db9fd6g-01}mmd.zipandmd.zip:html.zip:<img src="./images/2024_12_15_dfc981061e9740db9fd6g-01.jpg" alt="">
PDF lines data
Detailed line-by-line data for PDFs, useful for building custom experiences on top of original PDFs.
Response data object
pages List of PdfPageData objects
PdfPageData object
image-id PDF ID, plus hyphen, plus page number, starting at page 1
page Page number
lines List of LineData objects
page_height Page height (in pixel coordinates)
page_width Page width (in pixel coordinates)
PdfLineData object
id Unique line identifier
parent_id Unique line identifier of the parent.
children_ids List of children unique identifiers.
type See line types and subtypes for details.
subtype See line types and subtypes for details.
line Line number
text Searchable text, empty string for page elements that do not necessarily have associated text (for example individual equations inside block of math equations).
text_display Mathpix Markdown content with additional contextual elements such as article, section and inline image URLs. Can be empty for page elements which are not going to be rendered (for example page number, auxiliary text in the page header, etc.).
conversion_output When true, text_display from the line is included into final MMD output, otherwise it is not included.
is_printed True if line contains printed text, false otherwise.
is_handwritten True if line contains handwritten text, false otherwise.
region Specify the image area with the pixel coordinates top_left_x, top_left_y, width, and height
cnt Specifies the image area as list of (x,y) pixel coordinate pairs. This captures handwritten content much better than a region object
confidence Estimated probability 100% correct (product of per token OCR confidence).
confidence_rate Estimated confidence of output quality (geometric mean of per token OCR confidence).
PDF MMD lines data (deprecated)
Deprecated. Use lines.json instead, which contains all this information and more.
Response data object (MMD Lines)
pages List of PdfMMDPageData objects
PdfMMDPageData object
image-id PDF ID, plus hyphen, plus page number, starting at page 1
page Page number
lines List of PageMMDLineData objects
page_height Page height (in pixel coordinates)
page_width Page width (in pixel coordinates)
PdfMMDLineData object
line Line number
text Mathpix Markdown content with additional contextual elements such as article, section and inline image URLs
is_printed True if line contains printed text, false otherwise.
is_handwritten True if line contains handwritten text, false otherwise.
region Specify the image area with the pixel coordinates top_left_x, top_left_y, width, and height
cnt Specifies the image area as list of (x,y) pixel coordinate pairs. This captures handwritten content much better than a region object
confidence Estimated probability 100% correct (product of per token OCR confidence).
confidence_rate Estimated confidence of output quality (geometric mean of per token OCR confidence).
DELETE v3/pdf/{pdf_id}
Permanently delete a PDF's output data.
DELETE api.mathpix.com/v3/pdf/{pdf_id}
When a PDF is deleted:
- All output files are permanently removed from our servers (MMD, images, JSON Lines, and all requested formats)
- The original input file is also deleted
- This deletion is permanent and cannot be undone
Download and store files locally before deleting if you need to keep them.
Minimal metadata is retained for auditing and billing: status, input_file name, num_pages, timestamps, and processing version. No output content is stored.
If privacy is a concern, rename the file before upload to avoid storing identifiable filenames.