We're thrilled to introduce Shuka v1, a groundbreaking language model that natively understands audio in Indic languages. This innovative encoder-decoder model combines two powerful components: (i) Saaras v1 - Our state-of-the-art, in-house audio encoder, and (ii) Meta's Llama3-8B-Instruct - Serving as the decoder.
The magic happens in a small projector with approximately 60 million parameters, which bridges the encoder and decoder. During training, we only fine-tune the projector weights, keeping the rest of the network frozen. True to our philosophy of frugal model training, Shuka v1 is trained on less than 100 hours of audio data. You can see what Shuka v1 is capable of in the following demo video.
Technical Deep Dive
Our Saaras v1 encoder is a powerhouse, trained on a diverse dataset spanning over 10 languages. With 1.5 billion parameters and based on the Whisper architecture, it forms the backbone of Shuka v1's audio understanding capabilities. You can find more information about Saaras here.
Shuka v1 processes audio input by sampling 100 frames per second. The encoder generates representations for these frames, which are then grouped into stacks of 8. Each stack passes through the projector, producing a single "audio token" that's compatible with the decoder's embedding space. This clever approach allows the decoder to interpret each second of audio as approximately 13 "text tokens."
A crucial insight we gained during development was the importance of regenerating answers for our QA datasets using the intended decoder. For Shuka v1, we fed question transcriptions to Llama3 and used its outputs as gold-standard answers in the final training phase.
Evaluation
Shuka v1 outperforms much larger models, even on languages it wasn't explicitly fine-tuned for. We created a custom audio evaluation dataset featuring 100 samples in each of the 10 Indic languages supported by our encoder. This evaluation set is also open-sourced here to enable the research community.
Shuka v1 was pitted against two types of competitors: (i) "Direct audio" models: Gemini Flash and Gemini Pro, which process audio inputs and generate text answers, and (ii) "Pipeline" models: Llama3, GPT-4o-mini, and GPT-4o, which first perform ASR on the audio and then answer questions based on the transcribed text. All pipeline models in our comparison used multilingual whisper-large-v3 for ASR.
Our win-tie-loss analysis revealed that Shuka v1 significantly outperforms all pipeline models across languages while boasting faster generation of the first output token. When compared to the direct-audio Gemini models, Shuka v1 holds its own overall and excels in Gujarati, Hindi, Kannada, and Marathi.
Customize for your needs
We're excited to announce that we also support fine-tuning of Shuka v1 for customized use cases. This opens up a world of possibilities for businesses looking to tailor our audio language model to their specific needs. Whether you're working on a niche application, targeting specific languages, or aiming to optimize performance for particular domains, our fine-tuning support allows you to leverage the power of Shuka v1 while adapting it to your unique requirements. For more information, please reach out to us at partnerships@sarvam.ai.
Looking Ahead
Shuka v1 represents just the beginning of our journey in audio language understanding for Indic languages. We're releasing this first version Shuka on Huggingface to demonstrate what's possible with minimal resources and to inspire the community to build voice-first applications for Indic languages. Our next iteration promises to be even more powerful, trained on a multilingual dataset orders of magnitude larger than what we've used here. Stay tuned for more breakthroughs in audio language models for Indic languages!