August 23, 2024

Blog: Bulbul

Giving voice to India's Linguistic Diversity with

Bulbul

In the heart of Bangalore, a customer service representative struggles to explain complex banking terms to a Tamil speaking customer, while in a small town in Gujarat, a patient misunderstands crucial medication instructions due to language barriers. Across India, businesses lose millions in potential revenue and trust every day, not because of what they're saying, but how they're saying it.

The culprit? Outdated Text-to-Speech (TTS) technology that fails to capture the essence of how Indians truly communicate.

Imagine a world where:

• Your banking app speaks to you in fluent Hinglish, seamlessly blending Hindi and English just like your local bank teller.

• Healthcare hotlines pronounce medical terms accurately in Bengali, ensuring critical health information is never lost in translation.

• E-commerce platforms describe products in Tamil with the same enthusiasm and nuance as a local shopkeeper.

This isn't a distant future. With the latest advancements in language technology, it is happening now!

Announcing the launch of Bulbul v1 - our best-in-class code-mixed, multi-lingual text-to-speech model. Now available in 10+ languages!

Meet the Voices of Bulbul v1

Bulbul v1 comes with six distinct voices, each designed to cater to a wide range of communication needs across various industries and contexts:

{{bulbul_voices}}

While these distinct voices offer a range of personalities to suit different needs, what's truly revolutionary about Bulbul v1 is its ability to maintain a consistent voice across multiple languages. Imagine Meera explaining complex financial products in Hindi, English, Tamil, and Bengali – all with the same professional tone and personality. This consistency in voice across languages allows businesses to maintain continuity while communicating effectively with diverse linguistic communities.

But how did we achieve this level of linguistic dexterity and intelligence? Let's dive into the innovative approach we took in training Bulbul...

How we trained Bulbul?

For training Bulbul, we focused on the following aspects:

1. Multilingual Efficiency : We opted for a single, compact model with multilingual capabilities, enabling contextual learning transfer across languages.

2. Indian Context Mastery : Bulbul is trained on diverse vocabulary tailored to the Indian context, excelling at code-mixed language, domain-specific terms, local names, and special entities.

3. Prosody Control :  We engineered a pitch and pace-aware model, allowing for controllable prosody to suit various speech contexts.

Data: Our training data combines high-quality, diverse audio from multiple speakers and languages. We applied strict quality checks and incorporated vocabulary from various domains, including code-mixed inputs, proper nouns, and abbreviations. Voice selection focused on both professional and conversational tones to cover a wide range of use cases.

Model Training: Bulbul is designed for low latency and multilingual capabilities. The architecture enables real-time prosody adjustments and implements cross-lingual transfer learning. This allows voices trained in one language to perform well in others, enhancing the model's versatility across diverse applications.

What can you build with Bulbul?


1. Rich & reliable conversational experiences

In real life customer-facing scenarios, what businesses often need is ability to have a voice that represents their brand reliably, effectively, and consistently. While the text to speech technology has seen rapid improvements on the more human sounding speech synthesis side; what has been a missing focus from the dialogue is the need for colloquial delivery of the content itself. To truly bridge the gap between brands and their consumers, the TTS capability need to speak the language of users, pronounce domain specific terms and entity names accurately, and not trip over special entities like dates, currency symbols, abbreviations etc. With Bulbul, like all our other models, we took a very application and consumer first philosophy so it can be reliably used across workflows by enterprises.

{{bulbul_ecommerce}}

{{bulbul_fintech}}

{{bulbul_healthcare}}

2. Media and Education

In media and education, the text to speech technology requires ability to handle various accents, emotions, and complex narratives while maintaining clarity and engagement for a large, and fairly diverse set of audience

{{bulbul_audiobooks}}

{{bulbul_elearning}}

3. News and Entertainment

News broadcasting require clear pronunciation of names, places, acronyms and abbreviations, while making the content sound engaging. Typically, news is also delivered at a faster pace. Bulbul allows pace and pitch modulation for all voices across languages. So you can really configure and personalise your content delivery per your application.
On the other hand, cultural and fun applications require an understanding of regional nuances, appropriate emotional tones, and the ability to handle specialized vocabulary in the language people speak and consume content in.

{{bulbul_broadcasting}}

{{bulbul_astrology}}

4. Accessibility and Information Services

Accessibility services require clear enunciation, appropriate pacing, and the ability to convey visual information through audio effectively. The ability to be able to pronounce complex location names, communicate directions, and spell out numerals effectively can enable customers building these applications to really personalize these experiences for India's colloquial audience.

{{bulbul_maps}}

{{bulbul_iot}}


Conclusion

Bulbul v1 represents a significant leap forward in Text-to-Speech technology for India's diverse linguistic landscape. By embracing code-mixing, regional nuances, and domain-specific intelligence, we've created a tool that doesn't just speak to India, but speaks as India. From powering natural customer interactions and delivering engaging content to enabling fun, culturally-relevant applications, Bulbul opens up a world of possibilities for businesses across sectors. Our commitment goes beyond technology – we're dedicated to bridging communication gaps and fostering deeper connections between businesses and the 1.4 billion voices of India. With Bulbul v1, we invite you to join us in transforming how India communicates, one conversation at a time.

-- Draft Elements --
BULBUL

Meera

Professional and articulate

Arvind

Conversational and articulate

Maitryee

Engaging and informational

Amol

Narrational and mature  

Pavithra

Dramatic and engaging

Amartya

Expressive and distinct
E-commerce support
E-commerce requires clear communication of order details, prices, and delivery timelines, often mixing English terms with regional languages. Pick a voice for your brand and keep it consistent across all your communications and languages.
TTS Input: "Your order will be delivered in 2 days""Your order for 2 pairs of Allen Solly jeans and 1 Nike T-shirt has been confirmed. Total price: ₹3,999. Your order will be delivered in 2 days."
Hindi
Kanada
Odia
Telugu
Fintech Applications:
Financial services demand precise pronunciation of monetary values and financial terms, often involving large numbers and specialized vocabulary.
TTS Input: "Your account balance is ₹10,435.26. Kya aap ek FD open karna chahenge?"
Hindi
Punjabi
Tamil
Healthcare Communication:
Healthcare communication requires accurate pronunciation of medical terms, dosages, and instructions, often involving complex terminology and precise numerical information.
TTS Input: "Namaste Sharma ji, Dr. Gupta ne aapko Metformin 500mg prescribe kiya hai. Ise daily two times, subah aur shaam ko khana ke baad lena hai. Kya aapko koi side-effects ka anubhav ho raha hai?"
Hindi
Multilingual Audiobooks:
Audiobooks require consistent voice quality across languages, natural code-mixing, and expressive narration to bring stories to life. Give a unique voice to your characters in the same language.
TTS Input: "भगवान कृष्ण कहते हैं, सुखी जीवन जीने और स्वर्ग प्राप्त करने के लिए तपस्या और दान जैसे कुछ कार्य करने चाहिए। पुण्य कर्म करने से अनजाने में किए गए पाप भी नष्ट हो जाते हैं। इस प्रकार मनुष्य को नरक में नहीं जाना पड़ता।"
Hindi
Bengali
E-Learning Platform
Educational content often involves technical terms, mathematical expressions, and the need to maintain student engagement through varied intonation.
TTS Input:  "आज हम Einstein की Theory of Relativity के बारे में पढ़ेंगे। Theory कहती है कि समय और space एक दूसरे से जुड़े हुए हैं और इन्हें एक साथ space-time कहा जाता है। यह theory बताती है कि जब कोई object बहुत high speed से move करता है, तो उसके लिए time slow हो जाता है। इसे mathematically इस equation से express किया जा सकता है:

E = mc^2

जहाँ E energy है, m object का mass है, और c speed of light in vacuum है, जो लगभग 3 times 10^8 meters per second होती है। यह equation दिखाती है कि mass और energy interchangeable हैं और एक दूसरे में convert हो सकते हैं।"
Hindi
Multilingual news broadacasting
TTS Input with lots of abbreviation: "The ISRO (Indian Space Research Organisation) has successfully launched its latest satellite, GSAT-30, from the Satish Dhawan Space Centre. The satellite will enhance communication services across India. This achievement marks another milestone for ISRO following their earlier successful missions this year."
English
Tamil
Astrology Bot
Astrology applications need to convey mystical and predictive content with an appropriate tone and handling of astrological terminology.
TTS Input: "Namaste! Aaj aapka din shubh hai. Venus ki position se aapko aaj ek good news mil sakti hai. Office mein kisi senior se important task assign ho sakta hai. Stay confident!"
Hindi
Gujarati
Giving a Desi Touch to Google Maps:
Navigation services need to provide clear, timely instructions with accurate pronunciation of street names and landmarks.
TTS Input:  “Head south on Netaji Subhash Marg toward Dayanand Road. In 12 meters, turn left onto Dayanand Road. Continue straight for 350 meters, passing the United Bank of India ATM on your left."
Hindi
Speak to your users via IoT
Smart home devices need to convey information clearly and handle queries in natural, conversational language.
TTS Input:  "Good morning! It's 7:00 AM. The temperature today is 28 degrees Celsius, and the weather is very pleasant. You have a busy day ahead. Your first meeting is scheduled for 9:30 AM with the marketing team to discuss the upcoming campaign strategies.”
Marathi
Legal Documents
The powers of judicial review in the matters involving financial implications are also very limited. The wisdom and advisability of the Courts in the matters concerning the finance, are ordinarily not amenable to judicial review unless a gross case of arbitrariness or unfairness is established by the aggrieved party.​
Key Feature: With Formal Mode, you can create legal documents in different Indic languages while maintaining the formal tone.

Colloquial mode now empowers millions of Indians to access these complex documents by translating it in colloquial Indic language.
Other Translation Models
‍वित्तीय निहितार्थ से जुड़े मामलों में न्यायिक समीक्षा की शक्तियाँ भी बहुत सीमित हैं। वित्त से संबंधित मामलों में न्यायालयों का ज्ञान और सलाह आम तौर पर न्यायिक समीक्षा के लिए अनुकूल नहीं होते हैं जब तक कि पीड़ित पक्ष द्वारा मनमाने या अन्यायपूर्ण का एक गंभीर मामला स्थापित नहीं किया जाता है।​

Mayura (Formal + Preprocessing)
वित्त-संबंधी मामलों की समीक्षा करने के लिए न्यायपालिका की शक्ति काफी सीमित है। आम तौर पर, अदालतें वित्तीय मामलों में हस्तक्षेप नहीं करती हैं जब तक कि अन्याय या मनमाने ढंग से काम करने का स्पष्ट मामला न हो। यह आम तौर पर केवल तभी होता है जब निर्णय से प्रभावित व्यक्ति इसे साबित कर सकता है।​

Mayura (Colloqiual + Preprocessing)
Judiciary की financial-related cases को review करने की power बहुत restricted है। आमतौर पर, courts financial matters में interfere नहीं करते हैं जब तक कि unfairness या arbitrariness का clear case ना हो। ये आमतौर पर तभी होता है जब decision से प्रभावित व्यक्ति उसे prove कर सके।​
Unlock colloquial translation
I can help you sign up for our courses in just a few steps. Can you please provide your name and email address to get started?​


She's the GOAT when it comes to baking.
Formal
मैं कुछ ही चरणों में हमारे पाठ्यक्रमों के लिए साइन अप करने में आपकी मदद कर सकता हूँ। क्या आप कृपया अपना नाम और ईमेल पता प्रदान कर सकते हैं?

Colloquial
मैं आपको बस कुछ ही steps में हमारे courses के लिए sign up करने में मदद कर सकता हूँ। क्या आप अपना नाम और email address बता सकते हैं ताकि हम शुरू कर सकें?​

Other Models
जब बेकिंग की बात आती है तो वह बकरी है।

Colloquial Mode:
वे बेकिंग में महारत रखती हैं, उनके केक शानदार होते हैं।​

Visual
E-commerce requires clear communication of order details, prices, and delivery timelines, often mixing English terms with regional languages.
TTS Input: "Your order for 2 pairs of Allen Solly jeans and 1 Nike T-shirt has been confirmed. Total price: ₹3,999. Your order will be delivered in 2 days"
Hindi
Kanada
Healthcare Communication:
Healthcare communication requires accurate pronunciation of medical terms, dosages, and instructions, often involving complex terminology and precise numerical information.
TTS Input: "Namaste Sharma ji, Dr. Gupta ne aapko Metformin 500mg prescribe kiya hai. Ise daily two times, subah aur shaam ko khana ke baad lena hai. Kya aapko koi side-effects ka anubhav ho raha hai?"
Hindi
Gujarati
Multilingual Audiobooks:
Audiobooks require consistent voice quality across languages, natural code-mixing, and expressive narration to bring stories to life. Give a unique voice to your characters in the same language.
TTS Input:
कृष्ण: "अर्जुन, धर्म का मार्ग अक्सर चुनौतियों से भरा होता है, लेकिन विश्वास और संकल्प के साथ, सबसे अंधेरी रातें भी सुबह में बदल जाती हैं।"

अर्जुन: "कृष्ण, आपका ज्ञान हमारा मार्गदर्शक तारा है। मैं धर्म की रक्षा करने और अपने लोगों की रक्षा करने का प्रयास करूंगा।"

द्रौपदी: "कृष्ण, मेरा हृदय अन्याय के बोझ से भारी है, लेकिन आपकी उपस्थिति मुझे आशा से भर देती है। मुझे विश्वास है कि न्याय की जीत होगी।"
Krishna
Arjun
Draupadi
Male Professional newscaster voice in English:
TTS Input:  "The ISRO (Indian Space Research Organisation) has successfully launched its latest satellite, GSAT-30, from the Satish Dhawan Space Centre. The satellite will enhance communication services across India. This achievement marks another milestone for ISRO following their earlier successful missions this year."
TTS Output
Hindi (Female voice):
TTS Input:  "इसरो, Indian Space Research Organisation ने अपना latest satellite, GSAT-30, Satish Dhawan Space Centre से, successfully launch कर दिया है। , ये satellite पूरे India में, communication services को improve करेगा। , ये इस साल ISRO के successful missions के बाद , एक और बड़ी achievement है।"
TTS Output
Tamil (Male voice):
Phase
Phase 1
Phase 2
Phase 3
Input
English audio (sentences)
English + Hindi audio (sentences)
English + Hindi audio (questions)
Output
Transcriptions
English -> Transcriptions. Hindi -> Transcriptions translated to English
Answers in English
Hours of audio
35
100
30
LR schedule
Constant with warmup
Cosine decay
Cosine decay with warmup