Document Digitisation
Understand every
Indian document
Extract text, tables, and structure from documents with remarkable precision across 23 languages.
have knowledge of some vacant "Consulate" or "Special Service", that my Record and Endorsements would warrant my filling to the advantage of the Government
A Knowledge of your Selection and appointment of Such only as are most fitting for the place regarded of politics or local influence has prompted me you and myself to look to you Mr President for that just consideration we have failed to secure at other hands,
With the assurance of two having their Countrys welfare more at heart than their own personal interest believe us Mr President
Your Obt Servants
Wm H. Young and native of Erie County New York Wife F Rowland Young native of St Markes Florida
address P.O. box 565 Washington DC

Trusted by leading teams
Built for real document workloads
Production-grade Document AI with structured outputs, async processing, and enterprise-ready APIs.

23 language support
All 22 scheduled Indian languages plus English, with native Indic script recognition across every script family.

Complex table parsing
Accurately extracts tables with merged cells, multi-level headers, and invisible borders into clean HTML or Markdown.

Structured output formats
Four output modes: HTML, Markdown, JSON, and ZIP. Ready for downstream pipelines and LLM ingestion.

Any document format
PDF, PNG, JPG, and ZIP archives. Single pages or bulk batches, handled uniformly through the async job API.

SOTA benchmark accuracy
Leading performance on global document understanding benchmarks. Outperforms general-purpose models on Indian documents.
23 language support
All 22 scheduled Indian languages plus English, with native Indic script recognition across every script family.
Complex table parsing
Accurately extracts tables with merged cells, multi-level headers, and invisible borders into clean HTML or Markdown.
Structured output formats
Four output modes: HTML, Markdown, JSON, and ZIP. Ready for downstream pipelines and LLM ingestion.
Any document format
PDF, PNG, JPG, and ZIP archives. Single pages or bulk batches, handled uniformly through the async job API.
SOTA benchmark accuracy
Leading performance on global document understanding benchmarks. Outperforms general-purpose models on Indian documents.
Built for every document
workflow
Document digitisation
Convert scanned documents, PDFs, and legacy archives into structured, searchable digital formats across all Indian languages.
Government records & archives
Academic papers & textbooks
Legal documents & contracts
Historical & cultural manuscripts

State-of-the-art Document Digitisation
Leading performance on global benchmarks.
olmOCR: Overall Performance
Score (%) · Higher is better
23 languages, every script natively understood
Developer-first platform
OpenAI-compatible APIs. Drop-in SDKs for Python and Node.js. Go from zero to first extraction in under 5 minutes.
Async job-based API
Upload, process, and download. Designed for large documents and batch workflows with predictable throughput.
SDKs & libraries
Official Python and Node.js SDKs with TypeScript support. pip install sarvam-ai.
Complete documentation
Interactive API reference, code samples, and integration guides for every endpoint.
Free tier included
Start building immediately. No credit card, no sales call, no minimum commitment.
from sarvamai import SarvamAI client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY") with open("document.pdf", "rb") as f: response = client.document_digitisation.process( file=f, language="hi-IN", output_format="markdown", model="sarvam-ocr" ) print(response.markdown) print(f"Pages processed: {response.page_count}")
Enterprise-ready. Responsible AI.
Built with safety, compliance, and data sovereignty at the core.
SOC 2 Type II & ISO 27001
Enterprise-grade security certifications. Annual audits, documented controls, continuous monitoring.
Data sovereignty
All data processed and stored in India. No cross-border transfers. Full compliance with Indian data regulations.
No training on your data
Your API inputs are never used for model training. Zero data retention after processing unless explicitly requested.
On-premise deployment
Deploy within your own infrastructure for maximum control. Air-gapped environments supported for sensitive document workflows.
Redaction-ready output
Structured output with positional metadata enables downstream PII masking and redaction in compliance pipelines.
Audit-ready logging
Comprehensive API usage logs, access controls, and RBAC for enterprise governance and compliance reporting.
Base plan
Free trial included
No credit card required. Get API keys instantly.
Frequently asked questions
Start extracting in minutes. Go live today.
Start extracting in minutes.
Go live today.