Large language models have demonstrated remarkable capabilities across diverse tasks, yet their development has predominantly focused on English and other high-resource languages. This English-centric approach has created a significant technological gap for the billions of speakers of Indian languages. While there have been efforts to introduce Indic languages to popular LLMs through continued pretraining, ground-up multilingual efforts like BLOOM are rare. The effectiveness of these models is also limited by poor token efficiency for Indic scripts and insufficient high-quality training data for these languages.
We introduce Sarvam-1, a 2-billion parameter language model specifically optimized for Indian languages. Built from the ground up to support 10 major Indian languages alongside English, Sarvam-1 demonstrates that careful curation of training data can yield superior performance even with a relatively modest parameter count. Our work addresses two critical challenges in Indic language modeling:
- Token Efficiency: Existing multilingual models exhibit high token fertility (tokens needed per word) for Indic scripts, often requiring 4 to 8 tokens per word compared to 1.4 for English. Sarvam-1's tokenizer achieves significantly better efficiency, with fertility rates of 1.4-2.1 across all supported languages.
- Data Quality: While web-crawled Indic language data exists, it often lacks depth and quality. Through advanced synthetic-data-generation techniques, we have developed a high-quality training corpus of 2 trillion tokens, specifically for 10 Indic languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu).
Despite its compact size, Sarvam-1 demonstrates exceptional performance across standard benchmarks. It achieves high accuracy on both knowledge and reasoning tasks, especially in Indic languages, delivering state-of-the-art performance in its class. It also punches above its weight by being competitive to much larger models in most tasks. Concretely, it easily outperforms Gemma-2-2B and Llama-3.2-3B on a variety of standard benchmarks including MMLU, Arc-Challenge, and IndicGenBench, while achieving similar numbers to Llama 3.1 8B.
These results are particularly notable given Sarvam-1's size, which enables 4-6x faster inference compared to larger models while matching or exceeding their performance on Indic language tasks. This combination of high performance and computational efficiency makes Sarvam-1 particularly well-suited for practical applications, including deployment on edge devices. The model can be downloaded from 🤗 Hub.
Sarvam 2T: our Indic pretraining corpus
A key challenge in developing effective language models for Indian languages has been the scarcity of high-quality training data. While datasets like Sangraha exist, they often lack the depth, diversity, and quality necessary for training world-class models. Therefore, the bulk of our efforts has been focused on developing high-quality, diverse data that addresses these limitations.
Our training corpus, which we call Sarvam-2T, encompasses ~2 trillion Indic tokens in total. The data is almost evenly split between the 10 supported languages, with the exception of Hindi, which comprises about 20% of the data. For training Sarvam-1, we augmented Sarvam 2T with approximately equal amounts of English tokens, and a substantial collection of code covering most major programming languages. This balanced distribution ensures robust performance across both monolingual and multilingual tasks while maintaining decent coding capabilities.
Data Quality
Sarvam-2T demonstrates substantial improvements over existing Indic language datasets across multiple key metrics. Here is a comparison with Sangraha, the best open-source Indic pretraining corpus, which mostly contains documents crawled from the web:
Document Quality:
- Average document length is 2x longer compared to web data
- Quality assessment metrics show 3x more high-quality samples
- Significantly lower repetition rates and improved coherence scores
Content Distribution:
- 8x higher concentration of scientific and technical content
- 6x more programming and technical documentation
- Balanced representation across domains including academic, technical, and general knowledge
- Reduced coverage (0.5x) of potentially sensitive topics
This improved content distribution, particularly the increased representation of scientific and technical material, enhances the model's capabilities in tasks requiring complex reasoning and domain-specific knowledge. The longer document lengths support better understanding of context and discourse structure, while the higher quality metrics ensure reliable training signals for the model. The conscious decision to limit sensitive content demonstrates our balanced approach to creating a comprehensive yet responsible training corpus for Indic language AI development.
Examples
A few example snippets from Sarvam 2T are shown below:
Basic algebra in Hindi: यह सामग्री मानती है कि आप कुछ बुनियादी बीजगणित जानते हैं और बहुपदों के साथ संचालन कैसे करते हैं। विभिन्न असंबंधित विषयों पर चर्चा करने के बाद, अब हम बहुपदों पर ध्यान केंद्रित कर रहे हैं जिसमें कई चर होते हैं, जैसे कि $P(x, y, z)$ जब तीन चर ($n = 3$) होते हैं। इन बहुपदों को "समतुल्य" कहा जाता है यदि वे तब भी समान रहते हैं जब हम किसी भी दो चरों को बदलते हैं। उदाहरण के लिए, $P(x, y) = xy + 3$ सममित है क्योंकि $P(x, y) = P(y, x)$। दूसरी ओर, $xy^2 - 4x^2y$ सममित नहीं है।
Astronomy in Oriya: ଆମ ମୀଲକି ୱେ ଗ୍ୟାଲେକ୍ସର ଧାରରେ ଏକ ସ୍ୱତନ୍ତ୍ର ତାର ରହିଛି ଯାହା ତାରଗୁଡ଼ିକ କିପରି ତିଆରି ହୁଅନ୍ତି ସେ ବିଷୟରେ ଆମର ବୁଝାମଣାକୁ ଆହ୍ଵାନ କରେ । ଏହି ତାରା, ଯାହାକୁ ଏସ.ଡି.ଏସ.ଏସ. ଜେ୧୦୨୯୧୫+୧୭୨୯୨୭ କୁହାଯାଏ, ଆମ ବର୍ତ୍ତମାନର ସିଦ୍ଧାନ୍ତରେ ଠିଆ ହେବା ପରି ମନେ ହେଉନାହିଁ । ଏହା ଅତ୍ୟନ୍ତ ଛୋଟ ଏବଂ ପ୍ରାୟ ୧୩ ଶହ କୋଟି ବର୍ଷ ପୁରୁଣା - ଏବଂ ଏହାର ଆକାର ଥିବା ତାରା ସୃଷ୍ଟି କରିବା ପାଇଁ ଏଥିରେ ଯଥେଷ୍ଟ ପରିମାଣର ପଦାର୍ଥ ନାହିଁ।
Web design in Gujarati: એડિટર વિ. બ્રાઉઝર ડિસ્પ્લેને સમજવું
વેબ પેજ ડિઝાઇન કરતી વખતે, તમે મૂળભૂત લેઆઉટથી શરૂઆત કરો છો જે બીજી બધી વસ્તુઓ માટે મંચ સુયોજિત કરે છે. અહીં તે કેવું દેખાય છે:
```
<meta charset="UTF-8" />
<title>Page Title Goes Here</title>
Page Content Goes Here
```
આ દરેક વેબ પેજ માટે મૂળભૂત ટેમ્પલેટ છે. `<title>` ટેગમાં તમારા પૃષ્ઠનું શીર્ષક હોવું જરૂરી છે. જ્યારે કોઈ વ્યક્તિ તમારું પેજ ખોલે છે, ત્યારે તેઓ તેમના બ્રાઉઝરની ટોચની પટ્ટીમાં આ શીર્ષક જોશે. તમે જે કોઈ પણ વાસ્તવિક સામગ્રી ઇચ્છો છો તેને લોકોને `<body>` ટેગની અંદર જવું જોઈએ. `<body>` ટેગની અંદરનું બધું બ્રાઉઝર વિન્ડોમાં દેખાય છે. તમે `<body>` ટેગની અંદર ટેક્સ્ટ અને ઈમેજ મૂકો છો, અને પછી તેમને યોગ્ય રીતે ગોઠવવા માટે HTML ટેગમાં લપેટી.
Model
Tokenizer
We developed a custom tokenizer optimized specifically for Indic languages, featuring a vocabulary size of 68,096 tokens, with 4,096 tokens reserved for future expansion. A distinguishing characteristic of our tokenizer is its remarkably low fertility across all supported languages—a metric that measures the average number of tokens required to encode a given text sequence.
The efficiency of our tokenizer design plays a crucial role in maximizing the effective training signal from the Sarvam-2T corpus. While the raw token count stands at 2 trillion, the actual information density is substantially higher due to its low fertility. This enables each token to encapsulate more semantic information compared to conventional tokenizers used in other multilingual models. This increased information density has significant implications: when normalized for information content per token, we estimate that our 2 trillion tokens provide a training signal equivalent to 6-8 trillion tokens processed through other popular tokenizers.
As shown in the comparison chart below, our tokenizer achieves significantly lower fertility scores across Indic languages, directly translating to more efficient training and inference processes.
Architecture
Our model architecture follows established best practices with few exceptions. Notably, we opted for a deeper and thinner configuration compared to similar-sized models, a design choice supported by recent research demonstrating improved effectiveness.
Some key hyperparameters include:
- Hidden size: 2048
- Intermediate size: 11,008
- Number of attention heads: 16
- Number of hidden layers: 28
- Number of key-value heads: 8
- Maximum position embeddings: 8,192
The model uses SwiGLU as its hidden activation function and employs rotary positional embeddings (RoPE) with a theta value of 10,000. We train the model with grouped-query attention and bfloat16 mixed-precision for enhanced inference efficiency.
Training Infrastructure
The model was trained on Yotta's Shakti cluster, utilizing 1,024 GPUs over a 5-day period. We leveraged NVIDIA's NeMo framework for the training process, benefiting from its kernel fusion and other optimizations for large-scale language model training.
Evaluation
The evaluation of large language models for Indic languages presents unique challenges due to the scarcity of standardized benchmarks. To address this, we have structured our evaluation into two components: (1) performance on existing benchmarks adapted for Indic languages, and (2) downstream evaluation on Indic-relevant tasks. We compare Sarvam-1 against Gemma 2 2B, Llama 3.2 3B, and Llama 3.1 8B, noting that despite the larger size of Llama 3.1 8B, Sarvam-1 demonstrates competitive performance.
Academic Benchmarks
Evaluations translated from English
We have translated four widely-used benchmarks into 10 Indic languages to create a comprehensive evaluation suite:
- MMLU (Massive Multitask Language Understanding): A diverse set of multiple-choice questions spanning various domains, considered a key benchmark for assessing an LLM's broad knowledge.
- ARC-Challenge (AI2 Reasoning Challenge): A grade-school level question-answering dataset designed to evaluate the reasoning capabilities of LLMs.
- BoolQ: A binary (yes/no) question-answering dataset that tests both world knowledge and basic reasoning skills.
- TriviaQA: Originally a generation task for assessing factual retrieval, adapted to a multiple-choice format for this evaluation by randomly sampling three incorrect answers.
These translated datasets are open-sourced and available here. We report zero-shot performance for all models on these tasks, following standard practices in the field.
Across the four standard benchmarks, Sarvam-1 demonstrates strong performance across languages despite its smaller size compared to Llama 3.1 8B. While it trails slightly in English tasks when compared against the larger model, it consistently outperforms both Gemma 2B and Llama 3.2 3B in all evaluations. Most notably, Sarvam-1 achieves exceptional results on TriviaQA, with an average score of 90.62 across Indic languages, significantly surpassing even the larger Llama 3.1 8B model (61.47). On MMLU, ARC-Challenge, and BoolQ, it achieves a new state-of-the-art with an Indic average of 44.44 and 58.50, 80.68 respectively. For language-wise breakdown, see the Appendix.
IndicGenBench
Additionally, we evaluate on IndicGenBench, a benchmark suite from Google, comprising four datasets:
- CrossSum: Cross-lingual summarization, going from English documents to summaries in target Indic languages.
- Flores: Focused on English to Indic language translation.
- XORQA: A question-answering dataset with English context and questions, requiring answers in the target Indic language.
- XQUAD: A question-answering dataset with both context and questions in Indic languages.
We observe that, while other models show significant performance degradation in zero-shot settings for these tasks, Sarvam-1 maintains consistent performance, resulting in a substantial performance gap. However, to be consistent with literature, we report one-shot performance as recommended by the original paper.
On the IndicGenBench suite, Sarvam-1 shows particularly impressive results in translation tasks, achieving a remarkable average chrF++ score of 39.83 on Flores English-to-Indic translation, substantially outperforming all baseline models including Llama 3.1 8B (34.23). The model maintains competitive performance on cross-lingual summarization (CrossSum) with an average chrF++ of 20.48, and demonstrates strong cross-lingual question-answering capabilities on XORQA with an average word-level F1 of 25.27. While XQUAD results (41.58) are slightly below Llama 3.1 8B (44.04), Sarvam-1 still outperforms both Gemma 2B and Llama 3.2 3B, showing its effectiveness in handling complex multilingual question-answering tasks For language-wise breakdown, see the Appendix.
Example Use Case: Translation
To assess the practical utility of Sarvam-1, we conducted extensive evaluations on downstream tasks after fine-tuning. Translation performance serves as a particularly illustrative example of the model's capabilities and efficiency. We finetuned Sarvam-1 for English-to-Indic translation on the BPCC dataset, and evaluate its performance on IN22-Gen. Results demonstrate that Sarvam-1:
- Outperforms comparably sized models in its class
- Achieves BLEU scores (~20) comparable to significantly larger models like Gemma-2-9B and Llama-3.1-8B
A key advantage of Sarvam-1 is its computational efficiency: it is 4-6x faster inference speed compared to these larger models while maintaining competitive performance. The smaller parameter count enables cost-effective deployment in production environments.
This combination of strong performance and superior inference efficiency makes Sarvam-1 particularly well-suited for practical applications, including on edge devices. We can’t wait to see what the community builds with Sarvam-1!
Acknowledgements
We extend our sincere gratitude to several organizations and partners whose support was instrumental in the development and training of Sarvam-1:
NVIDIA: We thank NVIDIA for their valuable assistance with the NeMo codebase. Their expertise in large-scale model training frameworks significantly streamlined our development process and enabled efficient utilization of computational resources.
Yotta: Our appreciation goes to Yotta for providing access to their state-of-the-art GPU cluster, Shakti. This high-performance computing infrastructure was crucial for training Sarvam-1 at scale, allowing us to push the boundaries of Indic language model capabilities.
AI4Bharat: We are grateful for our academic partnership with AI4Bharat. Their expertise in Indian language technologies and their contributions to open-source language resources have been invaluable to our research and development efforts.
Appendix
Update (08 Nov, 2024): The results have been updated after annealing and model-merging showed significant improvements in performance (see the Llama technical report, Section 3.1.3, for more details on annealing).