Sarvam AI

Document Digitisation

Understand every Indian document

Extract text, tables, and structure from documents with remarkable precision across 23 languages.

have knowledge of some vacant "Consulate" or "Special Service", that my Record and Endorsements would warrant my filling to the advantage of the Government

A Knowledge of your Selection and appointment of Such only as are most fitting for the place regarded of politics or local influence has prompted me you and myself to look to you Mr President for that just consideration we have failed to secure at other hands,

With the assurance of two having their Countrys welfare more at heart than their own personal interest believe us Mr President

Your Obt Servants

Wm H. Young and native of Erie County New York Wife F Rowland Young native of St Markes Florida

address P.O. box 565 Washington DC

OCR Result
Enhanced Version
Enhanced Version
Want to use this API?

Trusted by leading teams

Built for real document workloads

Production-grade Document AI with structured outputs, async processing, and enterprise-ready APIs.

23 language support

All 22 scheduled Indian languages plus English, with native Indic script recognition across every script family.

Complex table parsing

Accurately extracts tables with merged cells, multi-level headers, and invisible borders into clean HTML or Markdown.

Structured output formats

Four output modes: HTML, Markdown, JSON, and ZIP. Ready for downstream pipelines and LLM ingestion.

Any document format

PDF, PNG, JPG, and ZIP archives. Single pages or bulk batches, handled uniformly through the async job API.

SOTA benchmark accuracy

Leading performance on global document understanding benchmarks. Outperforms general-purpose models on Indian documents.

Built for every document
workflow

Document digitisation

Convert scanned documents, PDFs, and legacy archives into structured, searchable digital formats across all Indian languages.

Government records & archives

Academic papers & textbooks

Legal documents & contracts

Historical & cultural manuscripts

Document digitisation
See it in action →

State-of-the-art Document Digitisation

Leading performance on global benchmarks.

olmOCR: Overall Performance

Score (%) · Higher is better

23 languages, every script natively understood

हिन्दीHindi · hi-IN
বাংলাBengali · bn-IN
தமிழ்Tamil · ta-IN
తెలుగుTelugu · te-IN
मराठीMarathi · mr-IN
ગુજરાતીGujarati · gu-IN
ಕನ್ನಡKannada · kn-IN
മലയാളംMalayalam · ml-IN
অসমীয়াAssamese · as-IN
اردوUrdu · ur-IN
संस्कृतम्Sanskrit · sa-IN
नेपालीNepali · ne-IN
डोगरीDogri · doi-IN
बड़ोBodo · brx-IN
ਪੰਜਾਬੀPunjabi · pa-IN
ଓଡ଼ିଆOdia · od-IN
कोंकणीKonkani · kok-IN
मैथिलीMaithili · mai-IN
سنڌيSindhi · sd-IN
कॉशुरKashmiri · ks-IN
মৈতৈলোন্Manipuri · mni-IN
ᱥᱟᱱᱛᱟᱲᱤSantali · sat-IN
EnglishEnglish · en-IN
Process documents in any language →

Developer-first platform

OpenAI-compatible APIs. Drop-in SDKs for Python and Node.js. Go from zero to first extraction in under 5 minutes.

Async job-based API

Upload, process, and download. Designed for large documents and batch workflows with predictable throughput.

SDKs & libraries

Official Python and Node.js SDKs with TypeScript support. pip install sarvam-ai.

Complete documentation

Interactive API reference, code samples, and integration guides for every endpoint.

Free tier included

Start building immediately. No credit card, no sales call, no minimum commitment.

from sarvamai import SarvamAI

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

with open("document.pdf", "rb") as f:
    response = client.document_digitisation.process(
        file=f,
        language="hi-IN",
        output_format="markdown",
        model="sarvam-ocr"
    )

print(response.markdown)
print(f"Pages processed: {response.page_count}")

Enterprise-ready. Responsible AI.

Built with safety, compliance, and data sovereignty at the core.

SOC 2 Type II & ISO 27001

SOC 2 Type II & ISO 27001

Enterprise-grade security certifications. Annual audits, documented controls, continuous monitoring.

Data sovereignty

Data sovereignty

All data processed and stored in India. No cross-border transfers. Full compliance with Indian data regulations.

No training on your data

No training on your data

Your API inputs are never used for model training. Zero data retention after processing unless explicitly requested.

On-premise deployment

On-premise deployment

Deploy within your own infrastructure for maximum control. Air-gapped environments supported for sensitive document workflows.

Redaction-ready output

Redaction-ready output

Structured output with positional metadata enables downstream PII masking and redaction in compliance pipelines.

Audit-ready logging

Audit-ready logging

Comprehensive API usage logs, access controls, and RBAC for enterprise governance and compliance reporting.

Simple, transparent pricing

Start free. Scale as you grow. No hidden costs.

Base plan

₹1.5 per page

Free trial included

No credit card required. Get API keys instantly.

PDF, PNG, JPG & ZIP support
HTML & Markdown output
Volume discounts available
Enterprise pricing available
23 languages included
Async job-based processing

Frequently asked questions

Start extracting in minutes. Go live today.