Skip to main content

Data Sources API

A data source is a registered pointer from your Mathpix account to a bucket or container you own, with an attached access grant. Once registered, you can pass s3://, gs://, or Azure Blob URLs as source_uri (read) or destination_uri (write) on POST /files/v1/uri and POST /files/v1/jobs.

The grant model is keyless by default — you grant a Mathpix identity scoped access via IAM role, AD app, or service-account impersonation. No secrets are uploaded to Mathpix; we obtain short-lived per-customer credentials on demand. (AWS supports a legacy access_key fallback; Azure and GCS do not.)

EndpointDescription
GET /files/v1/onboarding/identitiesGet the Mathpix identities you grant access to (call first)
POST /files/v1/data-sourcesRegister a bucket / container
POST /files/v1/data-sources/{id}/testVerify read (and write) access
GET /files/v1/data-sourcesList the group's registered data sources
DELETE /files/v1/data-sources/{id}Remove a data source

Setup at a glance

  1. GET /files/v1/onboarding/identities — fetch the Mathpix identities (AWS account, Azure app, GCS service account) you'll grant access to, plus the per-group ExternalId (used in the AWS trust policy and as the GCS bucket-control verification id).
  2. Grant Mathpix access on the cloud side (AWS, Azure, GCS).
  3. POST /files/v1/data-sources — register the bucket, providing the non-secret metadata (role ARN, tenant ID, etc.).
  4. POST /files/v1/data-sources/{id}/test — verify the grant works.
  5. Reference the bucket via source_uri / destination_uri on submissions.

GET /files/v1/onboarding/identities

GET api.mathpix.com/files/v1/onboarding/identities

Returns the Mathpix identities customers grant access to, plus the per-group external_id used in the AWS trust policy. Call this before setting up grants — the external_id is generated on first call and is immutable thereafter.

Example

curl -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/onboarding/identities
// Response 200
{
"aws": {
"trust_account_id": "426887012336",
"external_id": "<your-group-uuid>"
},
"azure": {
"app_id": "580fe4b7-6dbf-4313-adf4-10ea1fda13c7",
"tenant_id": "748276dd-b27a-439b-81a1-58334f1af34c"
},
"gcp": {
"service_account_email": "ingest@mathpix-files.iam.gserviceaccount.com",
"external_id": "<your-group-uuid>"
}
}

The endpoint is idempotent. external_id is a per-group identifier and can be reused across providers. It is used in both aws (for the IAM trust policy) and gcp (for the GCS bucket-control verification token).


Per-provider grant guides

You only need to set up the provider whose buckets you'll use as source_uri / destination_uri. Each provider's grant is independent.

AWS S3

How it works

Mathpix uses cross-account IAM role assumption to access your bucket. You create an IAM role in your AWS account that:

  • Has permission to read (and optionally write) the bucket — the role's permission policy.
  • Allows Mathpix's AWS account (426887012336) to call sts:AssumeRole on it, conditioned on a unique external ID — the role's trust policy.

At runtime, Mathpix calls AWS STS AssumeRole with your role ARN and the external ID; AWS returns temporary credentials Mathpix uses for the duration of the job.

The external ID is not a secret — it's a per-customer identifier that AWS uses to protect against the confused-deputy problem. Get yours from /onboarding/identities.

1. Create the IAM role

In the AWS Console (IAM → Roles → Create role) or via CLI / IaC, create a role with the two policies below.

Permission policy — what the role is allowed to do. Replace BUCKET_NAME with your bucket. Drop s3:PutObject if Mathpix only needs to read (no destination_uri write-back).

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowMathpixBucketAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::BUCKET_NAME",
"arn:aws:s3:::BUCKET_NAME/*"
]
}
]
}

Both resource forms are intentional: s3:ListBucket and s3:GetBucketLocation are bucket-level actions (need the bucket ARN without /*); s3:GetObject and s3:PutObject are object-level (need the /* suffix).

Trust policy — who can assume the role. Replace EXTERNAL_ID with the value from /onboarding/identities. The Mathpix account ID 426887012336 is fixed across all customers.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowMathpixAssumeRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::426887012336:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "EXTERNAL_ID"
}
}
}
]
}

The sts:ExternalId condition is required — without it, AWS rejects Mathpix's assume-role call.

2. Note the role ARN

Note the role ARN (arn:aws:iam::<your-account>:role/<role-name>), bucket name, and region — you'll pass them on registration (role_arn in provider_specific_details, plus bucket and region at the top level).

Azure Blob Storage

How it works

Mathpix authenticates against your Microsoft Entra ID tenant (formerly Azure AD) using a multi-tenant application registration — an app Mathpix owns that can be instantiated in any customer tenant as a service principal. You grant that service principal an Azure RBAC role on a storage account or container. At runtime Mathpix acquires a token from your tenant for our app and uses it to authorize requests against your storage.

Mathpix's application identity (also returned by /onboarding/identities):

FieldValue
Application (client) ID580fe4b7-6dbf-4313-adf4-10ea1fda13c7
Mathpix tenant ID748276dd-b27a-439b-81a1-58334f1af34c

1. Instantiate Mathpix's service principal in your tenant

This creates an Enterprise Application record for Mathpix in your tenant — required before any role assignment can target Mathpix. Use the Azure CLI (the Portal's Enterprise Applications search won't find Mathpix until this step has run in your tenant):

# Sign in to your tenant as a user who can add enterprise applications
az login --tenant YOUR_TENANT_ID

# Create the service principal for Mathpix's app. Idempotent — re-running is safe.
az ad sp create --id 580fe4b7-6dbf-4313-adf4-10ea1fda13c7

After this runs, Mathpix appears under Microsoft Entra ID → Enterprise Applications in your tenant.

2. Assign Mathpix a Storage role

Choose the role based on whether Mathpix needs to write results back:

  • Storage Blob Data Reader — read-only. Sufficient when the container is only a source_uri.
  • Storage Blob Data Contributor — read + write. Required for destination_uri write-back.

Choose the scope — which storage resources Mathpix can access:

ScopeExample
Single container/subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Storage/storageAccounts/ACCOUNT/blobServices/default/containers/CONTAINER
Whole storage account/subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Storage/storageAccounts/ACCOUNT
az role assignment create \
--assignee 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 \
--role "Storage Blob Data Contributor" \
--scope "/subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Storage/storageAccounts/ACCOUNT/blobServices/default/containers/CONTAINER"

RBAC assignments propagate in 30–60 seconds typically, up to 5 minutes.

3. Note your tenant and storage details

Note your tenant ID (az account show --query tenantId -o tsv) and storage account name (the ACCOUNT in https://ACCOUNT.blob.core.windows.net/) — you'll pass them as provider_specific_details.tenant_id and provider_specific_details.storage_account on registration.

Supported account types

SupportedStorageV2, BlobStorage, BlockBlobStorage account kinds on commercial Azure with the public endpoint
Not supportedFileStorage-only accounts; storage accounts behind a firewall or private endpoint without Mathpix's egress IPs allowlisted; sovereign clouds (Azure Government, Azure China)

If your storage account is behind a firewall, contact Mathpix support for the current set of egress IPs to allowlist.

Google Cloud Storage

GCS uses service-account impersonation: you create a service account in your project with the bucket access you need, then grant the Mathpix impersonator (ingest@mathpix-files.iam.gserviceaccount.com) roles/iam.serviceAccountTokenCreator on your service account. Mathpix mints short-lived tokens to act as your SA — no standing access.

1. Create a service account in your project, with the bucket access you need:

gcloud iam service-accounts create mathpix-reader \
--project=your-project \
--display-name="Mathpix Files API reader"

# Grant your SA read access on the bucket (use objectAdmin for read+write):
gsutil iam ch \
serviceAccount:mathpix-reader@your-project.iam.gserviceaccount.com:objectViewer \
gs://your-bucket

2. Grant the Mathpix impersonator tokenCreator on your service account:

gcloud iam service-accounts add-iam-policy-binding \
mathpix-reader@your-project.iam.gserviceaccount.com \
--member="serviceAccount:ingest@mathpix-files.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountTokenCreator"

3. Prove you control the bucket — write your verification id into it:

# Your verification id is the `gcp.external_id` value from /onboarding/identities.
printf '<your gcp.external_id from /onboarding/identities>' \
| gcloud storage cp - gs://your-bucket/.mathpix-verify

At registration, Mathpix impersonates your service account, reads gs://your-bucket/.mathpix-verify, and confirms its contents match your verification id. Because writing into a bucket requires write access, a match proves you control the bucket you're registering — this is GCS's confused-deputy protection (the analog of the AWS external_id in your trust policy). The object is only read once, at registration; you may delete it afterward (you'll need to recreate it if you ever re-register the bucket).

Note your project ID and target SA email — you'll pass them as provider_specific_details.gcp_project_id and provider_specific_details.target_sa_email on registration.


POST /files/v1/data-sources

POST api.mathpix.com/files/v1/data-sources

Register a bucket as a data source after completing the per-provider grant above. Returns a data_source_id.

Request

name string (optional)

Human-readable label (max 128 chars).

provider string

One of aws, azure, gcp.

bucket string

Bucket / container name (S3: your-bucket; Azure: the container; GCS: your-bucket).

region string (optional)

Required for AWS access_key auth; optional for iam_role (Mathpix discovers it via the bucket).

auth_method string
  • iam_role — AWS, cross-account role assumption (recommended).
  • access_key — AWS, legacy long-lived key in secret.
  • azure_ad — Azure, RBAC grant to Mathpix's multi-tenant app.
  • service_account — GCS, impersonation of a customer-owned SA.
provider_specific_details object

Provider-shaped non-secret metadata. See the per-provider shapes below.

secret string (optional)

Only for AWS access_key. Encrypted at rest. Azure and GCS reject any value here.

Per-provider provider_specific_details

{
"name": "prod-source",
"provider": "aws",
"bucket": "your-bucket",
"region": "us-east-1",
"auth_method": "iam_role",
"provider_specific_details": {
"iam_role_arn": "arn:aws:iam::123456789012:role/MathpixReader",
"aws_external_id": "<your external_id from /onboarding/identities>"
}
}

The external_id here must equal the one returned by /onboarding/identities for your group.

Response

// Response 200
{ "data_source_id": "7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01" }

Errors

CodeHTTPWhen it fires
bad_request400Missing required field, invalid provider/auth_method combo, secret provided for Azure/GCS, or (AWS) an aws_external_id that doesn't match your group's ExternalId.
bad_request400GCS only — bucket-control verification failed: gs://<bucket>/.mathpix-verify is missing, unreadable (the tokenCreator grant or bucket access isn't set up), or its contents don't match your verification id. See the GCS grant guide.
unavailable503GCS only — Mathpix couldn't reach the bucket to run verification; retry.
conflict409A data source for this (provider, bucket) already exists in your group — returned with the existing data_source_id.

For GCS, registration only succeeds once the verification object is in place, so a 200 already confirms the impersonation grant works. For AWS/Azure, call /test afterward to verify the grant end-to-end before sending real submissions.


POST /files/v1/data-sources/{id}/test

POST api.mathpix.com/files/v1/data-sources/{id}/test

Verify Mathpix can reach the bucket using the registered credentials. Performs a read probe (and, where applicable, a write probe). Best-effort — failures don't mutate the data source row.

Example

curl -X POST -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/data-sources/7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01/test
// Response 200 — both checks passed
{
"result": "ok",
"checks": { "read": true, "write": true },
"message": "OK"
}
// Response 200 — read works, write doesn't
{
"result": "failed",
"checks": { "read": true, "write": false },
"message": "Write probe failed: 403 AccessDenied (missing s3:PutObject)"
}
// Response 200 — grant not reachable
{
"result": "failed",
"checks": { "read": false },
"message": "AssumeRole failed: ExternalId mismatch"
}

Use this anytime customer-side IAM changes might have affected access — /test is the canonical way to confirm Mathpix can still reach your bucket.


GET /files/v1/data-sources

GET api.mathpix.com/files/v1/data-sources

List the data sources registered for your group. Secrets (for aws/access_key) are never returned.

Example

curl -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/data-sources
// Response 200 — a flat summary per source (no provider_specific_details, no secrets)
{
"data_sources": [
{
"data_source_id": "7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01",
"name": "prod-source",
"provider": "aws",
"bucket": "your-bucket",
"region": "us-east-1",
"auth_method": "iam_role",
"created_at": "2026-05-29T12:00:00Z"
},
{
"data_source_id": "9b2e4c71-8d3a-4f15-b6c2-1e0d9a8c7b34",
"name": "prod-gcs",
"provider": "gcp",
"bucket": "your-bucket",
"region": null,
"auth_method": "service_account",
"created_at": "2026-05-29T12:05:00Z"
}
]
}

DELETE /files/v1/data-sources/{id}

DELETE api.mathpix.com/files/v1/data-sources/{id}

Permanently remove a data source. Subsequent submissions that reference the bucket return 404 data_source_not_found.

curl -X DELETE -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/data-sources/7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01
// Response 200
{
"data_source_id": "7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01",
"status": "deleted"
}

Note: In-flight jobs that already started against this data source continue to completion using their cached credentials. Only new submissions are affected.

To fully revoke access, also remove the cloud-side grant:

  • AWS — delete the IAM role (or remove the trust-policy statement) in your account.
  • Azureaz role assignment delete --assignee 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 --scope "<your scope>". Optionally also delete the Mathpix Enterprise Application (Microsoft Entra ID → Enterprise Applications → Mathpix → Properties → Delete).
  • GCSgcloud iam service-accounts remove-iam-policy-binding <your-sa> --member="serviceAccount:ingest@mathpix-files.iam.gserviceaccount.com" --role="roles/iam.serviceAccountTokenCreator".

Errors

CodeHTTPWhen it fires
not_found404No data source with that id (or already deleted).
forbidden403Data source belongs to a different group.

Using a registered bucket

Once a data source is registered and /test returns ok, reference the bucket directly via source_uri or destination_uri:

# Read from your bucket; write results back to it
curl -X POST https://api.mathpix.com/files/v1/uri \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{
"source_uri": "s3://your-bucket/contracts/contract-001.pdf",
"destination_uri": "s3://your-bucket/processed/",
"conversion_formats": { "docx": true, "md": true }
}'

The same source_uri / destination_uri shape works on POST /files/v1/jobs for batches.


Troubleshooting

data_source_access_denied on submission

The data source is registered but Mathpix can't reach the bucket. Run /test for the specific failure message. Common causes by provider:

AWS

  • Trust policy not yet propagated — wait 30–60 seconds and retry.
  • ExternalId mismatch — verify the value in your trust policy against /onboarding/identities.
  • Mathpix can list the bucket but not read objects → permission policy missing s3:GetObject, or the object-level statement is missing the /* resource suffix.
  • Mathpix can read but writes fail → permission policy missing s3:PutObject, or the destination_uri points at a different bucket (each bucket needs its own data source).

Azure

  • 403 immediately after granting RBAC → assignment hasn't propagated; wait 1–5 minutes.

  • 403 after 10+ minutes → assignment landed on the wrong scope or the wrong principal. Verify both:

    # Confirm the service principal exists in your tenant
    az ad sp show --id 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 --query '{appId:appId, displayName:displayName}'

    # List Mathpix's role assignments
    az role assignment list --assignee 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 --all -o table

    If az role assignment list returns no rows for your storage scope, the assignment didn't land where you intended.

  • Mathpix can list the container but can't write → you assigned Storage Blob Data Reader instead of Storage Blob Data Contributor.

GCS

  • roles/iam.serviceAccountTokenCreator not yet granted, granted to the wrong principal, or still propagating.
  • Your target SA itself doesn't have read/write (objectViewer / objectAdmin) on the bucket.

data_source_not_found on submission

The bucket in your source_uri has no data source registered for your group. Either:

409 conflict on registration

A data source for that (provider, bucket) already exists in your group. The response includes the existing data_source_id — use that one, or DELETE the existing one first if you want to re-register with different settings.

GCS registration fails with bad_request (bucket-control verification)

GCS registration reads gs://<bucket>/.mathpix-verify and checks its contents against your verification id. The bad_request message tells you which check failed:

  • "not found" → you haven't written the verification object yet. See step 3 of the GCS grant guide.
  • "could not read" → the tokenCreator grant on your service account, or that SA's read access to the bucket, isn't in place (or is still propagating — wait 30–60s).
  • "does not match" → the object's contents aren't your current verification id. Re-fetch it from /onboarding/identities (the external_id value) and overwrite the object exactly, with no trailing characters.

secret rejected with 400 bad_request for Azure / GCS

Azure (azure_ad) and GCS (service_account) are keyless — they never accept a secret field. Only AWS access_key (legacy fallback) does. Use iam_role for AWS instead.


See also