What is Document Digitisation?

Sarvam's Document Digitisation API extracts text, tables, and structural information from scanned documents, PDFs, and images across 23 languages (22 Indian + English) with state-of-the-art accuracy. It's powered by a purpose-built 3B parameter vision-language model.

What languages does it support?

All 22 official Indian languages: Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Assamese, Urdu, Sanskrit, Nepali, Dogri, Bodo, Punjabi, Odia, Konkani, Maithili, Sindhi, Kashmiri, Manipuri, and Santali, plus English.

What input and output formats are supported?

Accepts PDF, PNG, JPG, and ZIP files. Output is delivered in your choice of HTML (rich formatting), Markdown (readable), or JSON (programmatic). ZIP output packages all results for batch jobs.

How does the async job API work?

Create a job with your desired language and output format, upload your document, trigger processing, then poll and download results. This design handles large documents and 10,000+ page batch workflows reliably.

How accurate is table extraction?

Document Digitisation handles merged cells, multi-level headers, and invisible borders with high fidelity. Row/column structure is fully preserved in clean HTML or Markdown tables.

Document Digitisation is priced at ₹1.50 per page. A free trial is available with no credit card required. Volume and enterprise pricing available for high-throughput use cases.

Document Digitisation

Understand every
Indian document

Extract text, tables, and structure from documents with remarkable precision across 23 languages.

have knowledge of some vacant "Consulate" or "Special Service", that my Record and Endorsements would warrant my filling to the advantage of the Government

A Knowledge of your Selection and appointment of Such only as are most fitting for the place regarded of politics or local influence has prompted me you and myself to look to you Mr President for that just consideration we have failed to secure at other hands,

With the assurance of two having their Countrys welfare more at heart than their own personal interest believe us Mr President

Your Obt Servants

Wm H. Young and native of Erie County New York Wife F Rowland Young native of St Markes Florida

address P.O. box 565 Washington DC

OCR Result

Enhanced Version

Want to use this API?

Trusted by leading teams

Built for real document workloads

Production-grade Document AI with structured outputs, async processing, and enterprise-ready APIs.

23 language support

All 22 scheduled Indian languages plus English, with native Indic script recognition across every script family.

Complex table parsing

Accurately extracts tables with merged cells, multi-level headers, and invisible borders into clean HTML or Markdown.

Structured output formats

Four output modes: HTML, Markdown, JSON, and ZIP. Ready for downstream pipelines and LLM ingestion.

Any document format

PDF, PNG, JPG, and ZIP archives. Single pages or bulk batches, handled uniformly through the async job API.

SOTA benchmark accuracy

Leading performance on global document understanding benchmarks. Outperforms general-purpose models on Indian documents.

23 language support

All 22 scheduled Indian languages plus English, with native Indic script recognition across every script family.

Complex table parsing

Accurately extracts tables with merged cells, multi-level headers, and invisible borders into clean HTML or Markdown.

Structured output formats

Four output modes: HTML, Markdown, JSON, and ZIP. Ready for downstream pipelines and LLM ingestion.

Any document format

PDF, PNG, JPG, and ZIP archives. Single pages or bulk batches, handled uniformly through the async job API.

SOTA benchmark accuracy

Leading performance on global document understanding benchmarks. Outperforms general-purpose models on Indian documents.

Built for every document
workflow

Document digitisation

Convert scanned documents, PDFs, and legacy archives into structured, searchable digital formats across all Indian languages.

Government records & archives

Academic papers & textbooks

Legal documents & contracts

Historical & cultural manuscripts

See it in action →

State-of-the-art Document Digitisation

Leading performance on global benchmarks.

olmOCR: Overall Performance

Score (%) · Higher is better

23 languages, every script natively understood

हिन्दीHindi · hi-IN

বাংলাBengali · bn-IN

தமிழ்Tamil · ta-IN

తెలుగుTelugu · te-IN

मराठीMarathi · mr-IN

ગુજરાતીGujarati · gu-IN

ಕನ್ನಡKannada · kn-IN

മലയാളംMalayalam · ml-IN

অসমীয়াAssamese · as-IN

اردوUrdu · ur-IN

संस्कृतम्Sanskrit · sa-IN

नेपालीNepali · ne-IN

डोगरीDogri · doi-IN

बड़ोBodo · brx-IN

ਪੰਜਾਬੀPunjabi · pa-IN

ଓଡ଼ିଆOdia · od-IN

कोंकणीKonkani · kok-IN

मैथिलीMaithili · mai-IN

سنڌيSindhi · sd-IN

कॉशुरKashmiri · ks-IN

মৈতৈলোন্Manipuri · mni-IN

ᱥᱟᱱᱛᱟᱲᱤSantali · sat-IN

EnglishEnglish · en-IN

Process documents in any language →

Developer-first platform

OpenAI-compatible APIs. Drop-in SDKs for Python and Node.js. Go from zero to first extraction in under 5 minutes.

Async job-based API

Upload, process, and download. Designed for large documents and batch workflows with predictable throughput.

SDKs & libraries

Official Python and Node.js SDKs with TypeScript support. pip install sarvam-ai.

Complete documentation

Interactive API reference, code samples, and integration guides for every endpoint.

Free tier included

Start building immediately. No credit card, no sales call, no minimum commitment.

from sarvamai import SarvamAI

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

with open("document.pdf", "rb") as f:
    response = client.document_digitisation.process(
        file=f,
        language="hi-IN",
        output_format="markdown",
        model="sarvam-ocr"
    )

print(response.markdown)
print(f"Pages processed: {response.page_count}")

Enterprise-ready. Responsible AI.

Built with safety, compliance, and data sovereignty at the core.

SOC 2 Type II & ISO 27001

Enterprise-grade security certifications. Annual audits, documented controls, continuous monitoring.

Data sovereignty

All data processed and stored in India. No cross-border transfers. Full compliance with Indian data regulations.

No training on your data

Your API inputs are never used for model training. Zero data retention after processing unless explicitly requested.

On-premise deployment

Deploy within your own infrastructure for maximum control. Air-gapped environments supported for sensitive document workflows.

Redaction-ready output

Structured output with positional metadata enables downstream PII masking and redaction in compliance pipelines.

Audit-ready logging

Comprehensive API usage logs, access controls, and RBAC for enterprise governance and compliance reporting.

Simple, transparent
pricing

Start free. Scale as you grow. No hidden costs.

Base plan

₹1.5 per page

Free trial included

No credit card required. Get API keys instantly.

PDF, PNG, JPG & ZIP support

HTML & Markdown output

Volume discounts available

Enterprise pricing available

23 languages included

Async job-based processing

Understand every
Indian document

Your Obt Servants

Built for real document workloads

23 language support

Complex table parsing

Structured output formats

Any document format

SOTA benchmark accuracy

23 language support

Complex table parsing

Structured output formats

Any document format

SOTA benchmark accuracy

Built for every document
workflow

Document digitisation

State-of-the-art Document Digitisation

olmOCR: Overall Performance

23 languages, every script natively understood

Developer-first platform

Enterprise-ready. Responsible AI.

SOC 2 Type II & ISO 27001

Data sovereignty

No training on your data

On-premise deployment

Redaction-ready output

Audit-ready logging

Simple, transparent
pricing

Frequently asked questions

What is Document Digitisation?

What languages does it support?

What input and output formats are supported?

How does the async job API work?

How accurate is table extraction?

What is the pricing?

Understand every Indian document

Your Obt Servants

Built for real document workloads

23 language support

Complex table parsing

Structured output formats

Any document format

SOTA benchmark accuracy

23 language support

Complex table parsing

Structured output formats

Any document format

SOTA benchmark accuracy

Built for every document workflow

Document digitisation

State-of-the-art Document Digitisation

olmOCR: Overall Performance

23 languages, every script natively understood

Developer-first platform

Enterprise-ready. Responsible AI.

SOC 2 Type II & ISO 27001

Data sovereignty

No training on your data

On-premise deployment

Redaction-ready output

Audit-ready logging

Simple, transparent pricing

Frequently asked questions

What is Document Digitisation?

What languages does it support?

What input and output formats are supported?

How does the async job API work?

How accurate is table extraction?

What is the pricing?

Understand every
Indian document

Built for every document
workflow

Simple, transparent
pricing