Data Sources API
A data source is a registered pointer from your Mathpix account to a bucket or container you own, with an attached access grant. Once registered, you can pass s3://, gs://, or Azure Blob URLs as source_uri (read) or destination_uri (write) on POST /files/v1/uri and POST /files/v1/jobs.
The grant model is keyless by default — you grant a Mathpix identity scoped access via IAM role, AD app, or service-account impersonation. No secrets are uploaded to Mathpix; we obtain short-lived per-customer credentials on demand. (AWS supports a legacy access_key fallback; Azure and GCS do not.)
| Endpoint | Description |
|---|---|
| GET /files/v1/onboarding/identities | Get the Mathpix identities you grant access to (call first) |
| POST /files/v1/data-sources | Register a bucket / container |
| POST /files/v1/data-sources/{id}/test | Verify read (and write) access |
| GET /files/v1/data-sources | List the group's registered data sources |
| DELETE /files/v1/data-sources/{id} | Remove a data source |
Setup at a glance
GET /files/v1/onboarding/identities— fetch the Mathpix identities (AWS account, Azure app, GCS service account) you'll grant access to, plus the per-group ExternalId (used in the AWS trust policy and as the GCS bucket-control verification id).- Grant Mathpix access on the cloud side (AWS, Azure, GCS).
POST /files/v1/data-sources— register the bucket, providing the non-secret metadata (role ARN, tenant ID, etc.).POST /files/v1/data-sources/{id}/test— verify the grant works.- Reference the bucket via
source_uri/destination_urion submissions.
GET /files/v1/onboarding/identities
GET api.mathpix.com/files/v1/onboarding/identities
Returns the Mathpix identities customers grant access to, plus the per-group external_id used in the AWS trust policy. Call this before setting up grants — the external_id is generated on first call and is immutable thereafter.
Example
curl -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/onboarding/identities
// Response 200
{
"aws": {
"trust_account_id": "426887012336",
"external_id": "<your-group-uuid>"
},
"azure": {
"app_id": "580fe4b7-6dbf-4313-adf4-10ea1fda13c7",
"tenant_id": "748276dd-b27a-439b-81a1-58334f1af34c"
},
"gcp": {
"service_account_email": "ingest@mathpix-files.iam.gserviceaccount.com",
"external_id": "<your-group-uuid>"
}
}
The endpoint is idempotent. external_id is a per-group identifier and can be reused across providers. It is used
in both aws (for the IAM trust policy) and gcp (for the GCS bucket-control verification token).
Per-provider grant guides
You only need to set up the provider whose buckets you'll use as source_uri / destination_uri. Each provider's grant is independent.
AWS S3
How it works
Mathpix uses cross-account IAM role assumption to access your bucket. You create an IAM role in your AWS account that:
- Has permission to read (and optionally write) the bucket — the role's permission policy.
- Allows Mathpix's AWS account (
426887012336) to callsts:AssumeRoleon it, conditioned on a unique external ID — the role's trust policy.
At runtime, Mathpix calls AWS STS AssumeRole with your role ARN and the external ID; AWS returns temporary credentials Mathpix uses for the duration of the job.
The external ID is not a secret — it's a per-customer identifier that AWS uses to protect against
the confused-deputy problem. Get yours from /onboarding/identities.
1. Create the IAM role
In the AWS Console (IAM → Roles → Create role) or via CLI / IaC, create a role with the two policies below.
Permission policy — what the role is allowed to do. Replace BUCKET_NAME with your bucket. Drop s3:PutObject if Mathpix only needs to read (no destination_uri write-back).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowMathpixBucketAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::BUCKET_NAME",
"arn:aws:s3:::BUCKET_NAME/*"
]
}
]
}
Both resource forms are intentional: s3:ListBucket and s3:GetBucketLocation are bucket-level actions (need the bucket ARN without /*); s3:GetObject and s3:PutObject are object-level (need the /* suffix).
Trust policy — who can assume the role. Replace EXTERNAL_ID with the value from /onboarding/identities. The Mathpix account ID 426887012336 is fixed across all customers.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowMathpixAssumeRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::426887012336:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "EXTERNAL_ID"
}
}
}
]
}
The sts:ExternalId condition is required — without it, AWS rejects Mathpix's assume-role call.
2. Note the role ARN
Note the role ARN (arn:aws:iam::<your-account>:role/<role-name>), bucket name, and region — you'll pass them on registration (role_arn in provider_specific_details, plus bucket and region at the top level).
Azure Blob Storage
How it works
Mathpix authenticates against your Microsoft Entra ID tenant (formerly Azure AD) using a multi-tenant application registration — an app Mathpix owns that can be instantiated in any customer tenant as a service principal. You grant that service principal an Azure RBAC role on a storage account or container. At runtime Mathpix acquires a token from your tenant for our app and uses it to authorize requests against your storage.
Mathpix's application identity (also returned by /onboarding/identities):
| Field | Value |
|---|---|
| Application (client) ID | 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 |
| Mathpix tenant ID | 748276dd-b27a-439b-81a1-58334f1af34c |
1. Instantiate Mathpix's service principal in your tenant
This creates an Enterprise Application record for Mathpix in your tenant — required before any role assignment can target Mathpix. Use the Azure CLI (the Portal's Enterprise Applications search won't find Mathpix until this step has run in your tenant):
# Sign in to your tenant as a user who can add enterprise applications
az login --tenant YOUR_TENANT_ID
# Create the service principal for Mathpix's app. Idempotent — re-running is safe.
az ad sp create --id 580fe4b7-6dbf-4313-adf4-10ea1fda13c7
After this runs, Mathpix appears under Microsoft Entra ID → Enterprise Applications in your tenant.
2. Assign Mathpix a Storage role
Choose the role based on whether Mathpix needs to write results back:
Storage Blob Data Reader— read-only. Sufficient when the container is only asource_uri.Storage Blob Data Contributor— read + write. Required fordestination_uriwrite-back.
Choose the scope — which storage resources Mathpix can access:
| Scope | Example |
|---|---|
| Single container | /subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Storage/storageAccounts/ACCOUNT/blobServices/default/containers/CONTAINER |
| Whole storage account | /subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Storage/storageAccounts/ACCOUNT |
az role assignment create \
--assignee 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 \
--role "Storage Blob Data Contributor" \
--scope "/subscriptions/SUB_ID/resourceGroups/RG/providers/Microsoft.Storage/storageAccounts/ACCOUNT/blobServices/default/containers/CONTAINER"
RBAC assignments propagate in 30–60 seconds typically, up to 5 minutes.
3. Note your tenant and storage details
Note your tenant ID (az account show --query tenantId -o tsv) and storage account name (the ACCOUNT in https://ACCOUNT.blob.core.windows.net/) — you'll pass them as provider_specific_details.tenant_id and provider_specific_details.storage_account on registration.
Supported account types
| Supported | StorageV2, BlobStorage, BlockBlobStorage account kinds on commercial Azure with the public endpoint |
| Not supported | FileStorage-only accounts; storage accounts behind a firewall or private endpoint without Mathpix's egress IPs allowlisted; sovereign clouds (Azure Government, Azure China) |
If your storage account is behind a firewall, contact Mathpix support for the current set of egress IPs to allowlist.
Google Cloud Storage
GCS uses service-account impersonation: you create a service account in your project with the bucket access you need, then grant the Mathpix impersonator (ingest@mathpix-files.iam.gserviceaccount.com) roles/iam.serviceAccountTokenCreator on your service account. Mathpix mints short-lived tokens to act as your SA — no standing access.
1. Create a service account in your project, with the bucket access you need:
gcloud iam service-accounts create mathpix-reader \
--project=your-project \
--display-name="Mathpix Files API reader"
# Grant your SA read access on the bucket (use objectAdmin for read+write):
gsutil iam ch \
serviceAccount:mathpix-reader@your-project.iam.gserviceaccount.com:objectViewer \
gs://your-bucket
2. Grant the Mathpix impersonator tokenCreator on your service account:
gcloud iam service-accounts add-iam-policy-binding \
mathpix-reader@your-project.iam.gserviceaccount.com \
--member="serviceAccount:ingest@mathpix-files.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountTokenCreator"
3. Prove you control the bucket — write your verification id into it:
# Your verification id is the `gcp.external_id` value from /onboarding/identities.
printf '<your gcp.external_id from /onboarding/identities>' \
| gcloud storage cp - gs://your-bucket/.mathpix-verify
At registration, Mathpix impersonates your service account, reads gs://your-bucket/.mathpix-verify, and confirms its contents match your verification id. Because writing into a bucket requires write access, a match proves you control the bucket you're registering — this is GCS's confused-deputy protection (the analog of the AWS external_id in your trust policy). The object is only read once, at registration; you may delete it afterward (you'll need to recreate it if you ever re-register the bucket).
Note your project ID and target SA email — you'll pass them as provider_specific_details.gcp_project_id and provider_specific_details.target_sa_email on registration.
POST /files/v1/data-sources
POST api.mathpix.com/files/v1/data-sources
Register a bucket as a data source after completing the per-provider grant above. Returns a data_source_id.
Request
name Human-readable label (max 128 chars).
provider One of aws, azure, gcp.
bucket Bucket / container name (S3: your-bucket; Azure: the container; GCS: your-bucket).
region Required for AWS access_key auth; optional for iam_role (Mathpix discovers it via the bucket).
auth_method iam_role— AWS, cross-account role assumption (recommended).access_key— AWS, legacy long-lived key insecret.azure_ad— Azure, RBAC grant to Mathpix's multi-tenant app.service_account— GCS, impersonation of a customer-owned SA.
provider_specific_details Provider-shaped non-secret metadata. See the per-provider shapes below.
secret Only for AWS access_key. Encrypted at rest. Azure and GCS reject any value here.
Per-provider provider_specific_details
- AWS (iam_role)
- Azure (azure_ad)
- GCS (service_account)
{
"name": "prod-source",
"provider": "aws",
"bucket": "your-bucket",
"region": "us-east-1",
"auth_method": "iam_role",
"provider_specific_details": {
"iam_role_arn": "arn:aws:iam::123456789012:role/MathpixReader",
"aws_external_id": "<your external_id from /onboarding/identities>"
}
}
The external_id here must equal the one returned by /onboarding/identities for your group.
{
"name": "prod-azure",
"provider": "azure",
"bucket": "your-container",
"auth_method": "azure_ad",
"provider_specific_details": {
"azure_tenant_id": "<your-tenant-uuid>",
"storage_account": "youraccount"
}
}
{
"name": "prod-gcs",
"provider": "gcp",
"bucket": "your-bucket",
"auth_method": "service_account",
"provider_specific_details": {
"gcp_project_id": "your-project",
"target_sa_email": "mathpix-reader@your-project.iam.gserviceaccount.com"
}
}
Response
// Response 200
{ "data_source_id": "7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01" }
Errors
| Code | HTTP | When it fires |
|---|---|---|
bad_request | 400 | Missing required field, invalid provider/auth_method combo, secret provided for Azure/GCS, or (AWS) an aws_external_id that doesn't match your group's ExternalId. |
bad_request | 400 | GCS only — bucket-control verification failed: gs://<bucket>/.mathpix-verify is missing, unreadable (the tokenCreator grant or bucket access isn't set up), or its contents don't match your verification id. See the GCS grant guide. |
unavailable | 503 | GCS only — Mathpix couldn't reach the bucket to run verification; retry. |
conflict | 409 | A data source for this (provider, bucket) already exists in your group — returned with the existing data_source_id. |
For GCS, registration only succeeds once the verification object is in place, so a 200 already confirms the impersonation grant works. For AWS/Azure, call /test afterward to verify the grant end-to-end before sending real submissions.
POST /files/v1/data-sources/{id}/test
POST api.mathpix.com/files/v1/data-sources/{id}/test
Verify Mathpix can reach the bucket using the registered credentials. Performs a read probe (and, where applicable, a write probe). Best-effort — failures don't mutate the data source row.
Example
curl -X POST -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/data-sources/7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01/test
// Response 200 — both checks passed
{
"result": "ok",
"checks": { "read": true, "write": true },
"message": "OK"
}
// Response 200 — read works, write doesn't
{
"result": "failed",
"checks": { "read": true, "write": false },
"message": "Write probe failed: 403 AccessDenied (missing s3:PutObject)"
}
// Response 200 — grant not reachable
{
"result": "failed",
"checks": { "read": false },
"message": "AssumeRole failed: ExternalId mismatch"
}
Use this anytime customer-side IAM changes might have affected access — /test is the canonical way to confirm Mathpix can still reach your bucket.
GET /files/v1/data-sources
GET api.mathpix.com/files/v1/data-sources
List the data sources registered for your group. Secrets (for aws/access_key) are never returned.
Example
curl -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/data-sources
// Response 200 — a flat summary per source (no provider_specific_details, no secrets)
{
"data_sources": [
{
"data_source_id": "7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01",
"name": "prod-source",
"provider": "aws",
"bucket": "your-bucket",
"region": "us-east-1",
"auth_method": "iam_role",
"created_at": "2026-05-29T12:00:00Z"
},
{
"data_source_id": "9b2e4c71-8d3a-4f15-b6c2-1e0d9a8c7b34",
"name": "prod-gcs",
"provider": "gcp",
"bucket": "your-bucket",
"region": null,
"auth_method": "service_account",
"created_at": "2026-05-29T12:05:00Z"
}
]
}
DELETE /files/v1/data-sources/{id}
DELETE api.mathpix.com/files/v1/data-sources/{id}
Permanently remove a data source. Subsequent submissions that reference the bucket return 404 data_source_not_found.
curl -X DELETE -H 'app_key: APP_KEY' \
https://api.mathpix.com/files/v1/data-sources/7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01
// Response 200
{
"data_source_id": "7a3f1e90-2c4b-4d8a-9f01-5b6c7d8e9a01",
"status": "deleted"
}
Note: In-flight jobs that already started against this data source continue to completion using their cached credentials. Only new submissions are affected.
To fully revoke access, also remove the cloud-side grant:
- AWS — delete the IAM role (or remove the trust-policy statement) in your account.
- Azure —
az role assignment delete --assignee 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 --scope "<your scope>". Optionally also delete the Mathpix Enterprise Application (Microsoft Entra ID → Enterprise Applications → Mathpix → Properties → Delete). - GCS —
gcloud iam service-accounts remove-iam-policy-binding <your-sa> --member="serviceAccount:ingest@mathpix-files.iam.gserviceaccount.com" --role="roles/iam.serviceAccountTokenCreator".
Errors
| Code | HTTP | When it fires |
|---|---|---|
not_found | 404 | No data source with that id (or already deleted). |
forbidden | 403 | Data source belongs to a different group. |
Using a registered bucket
Once a data source is registered and /test returns ok, reference the bucket directly via source_uri or destination_uri:
# Read from your bucket; write results back to it
curl -X POST https://api.mathpix.com/files/v1/uri \
-H 'app_key: APP_KEY' \
-H 'Content-Type: application/json' \
--data '{
"source_uri": "s3://your-bucket/contracts/contract-001.pdf",
"destination_uri": "s3://your-bucket/processed/",
"conversion_formats": { "docx": true, "md": true }
}'
The same source_uri / destination_uri shape works on POST /files/v1/jobs for batches.
Troubleshooting
data_source_access_denied on submission
The data source is registered but Mathpix can't reach the bucket. Run /test for the specific failure message. Common causes by provider:
AWS
- Trust policy not yet propagated — wait 30–60 seconds and retry.
ExternalIdmismatch — verify the value in your trust policy against/onboarding/identities.- Mathpix can list the bucket but not read objects → permission policy missing
s3:GetObject, or the object-level statement is missing the/*resource suffix. - Mathpix can read but writes fail → permission policy missing
s3:PutObject, or thedestination_uripoints at a different bucket (each bucket needs its own data source).
Azure
-
403immediately after granting RBAC → assignment hasn't propagated; wait 1–5 minutes. -
403after 10+ minutes → assignment landed on the wrong scope or the wrong principal. Verify both:# Confirm the service principal exists in your tenant
az ad sp show --id 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 --query '{appId:appId, displayName:displayName}'
# List Mathpix's role assignments
az role assignment list --assignee 580fe4b7-6dbf-4313-adf4-10ea1fda13c7 --all -o tableIf
az role assignment listreturns no rows for your storage scope, the assignment didn't land where you intended. -
Mathpix can list the container but can't write → you assigned
Storage Blob Data Readerinstead ofStorage Blob Data Contributor.
GCS
roles/iam.serviceAccountTokenCreatornot yet granted, granted to the wrong principal, or still propagating.- Your target SA itself doesn't have read/write (
objectViewer/objectAdmin) on the bucket.
data_source_not_found on submission
The bucket in your source_uri has no data source registered for your group. Either:
- Register it via
POST /files/v1/data-sources, or - Check the bucket name in the URI matches the
bucketof an existing data source (GET /files/v1/data-sources).
409 conflict on registration
A data source for that (provider, bucket) already exists in your group. The response includes the existing data_source_id — use that one, or DELETE the existing one first if you want to re-register with different settings.
GCS registration fails with bad_request (bucket-control verification)
GCS registration reads gs://<bucket>/.mathpix-verify and checks its contents against your verification id. The bad_request message tells you which check failed:
- "not found" → you haven't written the verification object yet. See step 3 of the GCS grant guide.
- "could not read" → the
tokenCreatorgrant on your service account, or that SA's read access to the bucket, isn't in place (or is still propagating — wait 30–60s). - "does not match" → the object's contents aren't your current verification id. Re-fetch it from
/onboarding/identities(theexternal_idvalue) and overwrite the object exactly, with no trailing characters.
secret rejected with 400 bad_request for Azure / GCS
Azure (azure_ad) and GCS (service_account) are keyless — they never accept a secret field. Only AWS access_key (legacy fallback) does. Use iam_role for AWS instead.
See also
POST /files/v1/uri— single-document submission using a registeredsource_uri.POST /files/v1/jobs— bulk submission.- Files API Quickstart — end-to-end walkthrough.