Amazon unveils BASE TTS model a text-to-speech technology showing emergent abilities

BASE TTS Model

Named Big Adaptive Streamable TTS with Emergent abilities, or BASE TTS, the model is based on 100,000 hours of public domain speech, predominantly in English with portions in German, Dutch, and Spanish. The largest version, BASE-large, boasts 980 million parameters, making it the most substantial model in its category.

Researchers also trained smaller versions of the model with 400 million and 150 million parameters, using varying amounts of training data. Surprisingly, the medium-sized model demonstrated the desired leap in capability, showcasing emergent abilities not just in speech quality but in handling challenging tasks that weren’t explicitly part of its training.

The researchers conducted tests involving compound nouns, emotions, foreign words, paralinguistics, punctuations, questions, and syntactic complexities. BASE TTS excelled in handling these challenges, outperforming other models like Tortoise and VALL-E. Examples of difficult texts spoken naturally by the model are available on the researchers’ website.

BASE TTS Samples

None of the speaker identities synthesized here were used at training, so the model recreates them by taking a single utterance as a reference. Listen to some examples for each ability in English and Spanish here.

These samples are licensed under the Creative Commons Attribution-NonCommercial (CC BY-NC) license.

Compound nouns: Every morning, I make my favorite breakfast sandwich: avocado, egg, and cheese on a bagel. Once, the toaster oven malfunctioned, so I resorted to the stovetop frying pan.
Emotions: With a gentle touch and a loving smile, she reassured, “Don’t worry, my love. We’ll get through this together, just like we always have. I love you.”
Foreign words: Lasso’s novella, rich in allegory and imbued with a sense of ennui, drew from his experiences living in a French château up near the border.
Paralinguistics (i.e. readable non-words): David whispered to Emily as the lights dimmed in the theater, “Shh, it’s starting.”
Punctuations: Jackson stumbled over his words, clearly nervous… well… it’s not like it really matters now, does it?
Questions: And over there in that seemingly normal puritan household in New England, are those children flying around the room?
Syntactic complexities: In the classroom, filled with the chatter of students sharing their holiday stories and the rustling of new textbooks, Mrs. Thompson, excited to embark on a new academic year, prepared a lesson that would challenge and inspire her students.

The success of BASE TTS is attributed to both its size and extensive training data. While it still faces challenges, it outperforms existing models in handling complexities that typically trip up text-to-speech engines. However, it’s essential to note that BASE TTS is an experimental model, and further research is required to identify the inflection point for emergent ability and enhance the training and deployment process.

Amazon unveils new innovative Alexa experiences

One notable feature of BASE TTS is its “streamable” nature, allowing it to generate speech moment by moment at a relatively low bitrate. Additionally, the researchers have explored packaging speech metadata, such as emotionality and prosody, in a separate, low-bandwidth stream to accompany the audio.

The potential breakthrough in text-to-speech models comes at a crucial time in 2024, with implications for accessibility and broader applications. However, the researchers have chosen not to publish the model’s source and other data due to concerns about potential misuse by bad actors, emphasizing the need for responsible deployment of advanced AI technology.