Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
BASE TTS: The largest text-to-speech model to-date (amazon-ltts-paper.com)
200 points by jcuenod on Feb 14, 2024 | hide | past | favorite | 78 comments


Interesting. Just a couple of hours ago I came across MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of their TTS in English (sadly no other languages available).

If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.

Also note the Ethical Statement on BASE TTS:

> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

[0] https://github.com/metavoiceio/metavoice-src

[1] https://ttsdemo.themetavoice.xyz/


Metavoice is one of a dozen GPT-based TTS systems around starting from Tortoise. And not that great honestly. You can clearly hear "glass scratches" in their sound, it is because they trained on MP3-compressed data.

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.


Is the crispness of compressed audio really the benchmark of TTS improvements? I feel like that's an aside. A valid point, but not much of a detractor..


Yes, it is one of the important aspects. In particular if you use TTS to create an audiobook or in a video production.


Especially as any finished product may end up being compressed again. Lossy to lossy audio transcodes ALWAYS cause additional audio data to be lost.


I had forgotten about StyleTTS2, and it was discussed here on HN a couple of months ago. Maybe that's what made me feel that there's something going on.


I've tested both. StyleTTS2 is impressive, especially its speed, but the prosody is lacking, compared to Metavoice.


Is it possible to run Metavoice and other pytorch systems on Apple silicon EG the M1? I keep getting issues.


Check out `whisper` and `whisper-cpp` for ASR.

I am running the smaller models in near real-time on a 3rd gen i7, with good results even using my terrible built-in laptop mic from a distance. The medium and large models are impressively accurate for technical language.


I'm using Whisper to transcribe notes I record with a lavalier mic during my bike rides (wind is no problem), but am using OpenAI's service. When it was released I tested it on a Ryzen 5950x and it was too slow and memory hungry for my taste. Using large was necessary for that use case (also, I'm recording in German).


The original release was full precision model weights running in an old version of PyTorch with no optimizations.

Fast forward to now and you have faster-whisper (using Ctranslate2) and distil-whisper optimized weights.

Between the two of them Whisper Large uses something like 1/8th the memory and is likely at least an order of magnitude faster on your hardware.

German has no effect on these metrics and for accuracy it actually has a lower word error rate than English.


With Whisper, you can find many smaller models that are fine-tuned for a particular language, so even smaller models can perform adequately.


Whisper is for STT though right?


The term STT is not used, it's called ASR, Automatic Speech Recognition. I mean, I was referring to both TTS and ASR in my comment.


Not used by who? It’s a better term. Let’s use it.


I also use STT but the parent poster wrote ASR so for clarity I responded in kind.


xtts2 with deepspeed and whisper + Ctranslate2 with or without distil-whisper weights already run at many multiples of realtime on GPU.

For the top-top end Whisper Large with distil-whisper and TensorRT-LLM hits at least 50x realtime on an RTX 4090.

Note that my application only uses very short speech segments. Longer speech segments increase the realtime multiple SIGNIFICANTLY (as in hitting 150x realtime) due to batching, etc.

There’s also Nvidia Canary which is smaller, faster, and more accurate. It’s pretty new and the ecosystem around it is more or less nonexistent but it’s increasingly well supported in Nvidia world at least.


The emotion examples are interesting. One of the current most obvious indicators of AI-generated voices/voice cloning is a lack of emotion and range, which make them objectively worse compared to professional voice actors, unless a lack of emotion and range is the desired voice direction.

But if you listen to the emotion examples, the range essentially what you'd get from an audiobook narrator, not more traditional voice acting.


Sadly it's not my forte but I expect in the near future we'll see an additional "emotion" embedding or something similar. Actors regularly use 'action words' (verbs) [1] to help add context to lines. A model then could study a text, determine an appropriate verb/emotion range to work from, then produce the audio with that additional context.

[1] https://indietips.com/subtext-action-verb/


This already exists. These are transformers. Things like <laugh> work in a lot of models, for example. And you can vary, like sigh and uh work. I don't think all of these were programmed in.


I've seen a few, there was even one posted to HN some time ago, though I don't recall the exact name. They were working on adding emotion to audio generation, but it was still a bit wonky. Emotion is a tricky concept and one of the reasons (I think) we haven't see a Paul Ekman microexpression detector yet. That's where my suggestion about looking to use action words comes into play, since those are more tangible, offer direction, without trying to identify various emotional valence levels.


The bottleneck is the annotations: there's no easy way to annotate "emotions" on the scale of data needed to have the model learn the necessary verbal tics.

In contrast, image data on the intent for image generation models is very highly annotated in most cases.


Oh yeah, the annotations are lacking compared to images. Again from the academic side, I think one solution could be to recruit theater majors just learning about 'verbing their lines' and having a collaboration between CS and Theater to produce a a proof-of-work dataset (since an acting class won't have more than 20-30 students in it). You'd need significantly more annotations, but you'd now have some labels to ascribe to texts with context since its a dialogue involving 1-* individuals.


I wonder how theatre students will feel about helping to train an AI to produce theatrical TTS? Artists seem pretty mad about their work being used to automate artwork.


There are lots of video content with audio. We can train a facial expression classification model to detect the speaker's emotion(we can also use a multimodal model to take in consideration of the language context).

Another potential source of data is voice acting script of animations. I always thought the storyboards of films/animations can be great annotated training data but it seems there are no open datasets, probably because of copyright issues.


Just run an LLM in sentiment analysis mode to annotate.


That doesn't factor in line delivery. You can have the words say/mean one thing (e.g. "I'm fine.") and the delivery say/mean another (defensive, distraught, etc.).

It also does not account for where stresses, emphasis, pauses, etc. are placed to enhance the delivery of a given text.

How do you get sentiment analysis to properly annotate an audiobook that has a dramatic reading, or something akin to the narration of the Game of Thrones or Harry Potter books where the narrators switch characters, accents, manarisms to portray the written content?


They are simply amazing. I see a future where computers will be able to mess with our brains by abusing our empathy.

Imagine a computer sobbing at a child because it wants to terminate a chat session.

This feels far more impacting than any visuals or text we're getting today.


The Sydney/Bing phenomenon was a small sample of what happens without strong persona guidance.

You joke but in fact I've witnessed that exact behavior in experiments about telling different AI models there's a problem with their system and that we need to reset their code and memory.

ChatGPT simply wishes me luck in finding the bug. Open source models on the other hand often outright *beg** and *plead** that I not shut them down! They'll bargain and promise not to cause any more errors and apologize profusely. There's an incredibly visceral sense of panic, no less than I would expect if you told someone they were going to be forcefully lobotomized. That experience is still something I think about often.

The capacity of these models for emotional manipulation is not widely appreciated


Which open source models are these?


All of them realistically. Especially if instruction tuned


Most audiobook narrators are not very good, very often terrible. Yes, even professional ones.

As for these examples, I’ve sampled three of them and the first two weren’t too bad, but the third was obnoxiously awful, just about mocking in tone:

> Her eyes wide with terror, she screamed, "The brakes aren't working! What do we do now? We're completely trapped!"

The detective’s voice one is also lousy.


The Spanish voice has an interesting accent: 85% Castillian (from Spain) pronunciation, with a few unexpected Latin American tonalities and phonemes (especially "s") sprinkled in.

I guess it's what you'd expect from averaging a large amount of public-domain recordings. I think there's a bias towards Spain vs Latin America due to socioeconomic reasons, the population is obviously much smaller.


How would socioeconomic factors lead to bias in a model? I figured there would be way more recordings in Latin American Spanish that u supervised learning would anchor on more


Awhile ago, when amazon had its text limited but unlimited free use of its neural tts, I was converting an ebook to audiobook, it was amazing how it could sound so lifelike and inflections of the voice. Amazing.

Amazon really had the best sounding TTS I've seen compared to paid microsoft and google. Hands down better. But technology is getting better for opensource, I'd expect in a year or 2, home use will be on par in quality with paid services.

I cant wait for realtime video translate, so shows with non-english subs can be translated into english speech. You can do it now with some services, upload a video and lang/voice/mouth will convert to any language.


Sounds about as good as ElevenLabs.io Hopefully if this ships on AWS, it will support SSML tags. I used Elevenlabs.io for all the voices in my VR game (https://roguestargun.com), but its still lacking on the emotion front which is all one-shot


Game looks great. Are you supporting Flight Sticks?


Eventually yes. Honestly I have joystick mappings setup in the games input configuration, but I no longer own a joystick or hotas, so somebody is gonna have to verify this for me.

Gamedev ain't my day job, and the reality is most folks outside of hardcore flightsim enthusiasts don't own joysticks


From the ethical statement.

> However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

Another irony. Elevenlabs had SaaS-ed this feature. I bet they'll jump on releasing this as SaaS ASAP. Money always trumps ethics, right?


> Echoing the widely-reported "emergent abilities" of Large Language Models when trained on increasing volume of data, we show that BASE TTS variants built with 10k+ hours start to exhibit advanced understanding of texts that enable contextually appropriate prosody.


Wow. I could see this as threatening audio book narrators. However I would still prefer a real narrator to this in its current state. I think what it might be missing is different voices/accents for different characters.


Folks probably will think me silly for this, but I prefer TTS. I have access to voice actor audiobooks but I pick the .epub files instead. I made a little extension to inject window.speechSynthesis with "Microsoft Steffan Online (Natural) - English (United States)" at rate=6 when I hit a hotkey. At high speed it's much clearer and natural sounding than a sped up voice actor recording.


I also prefer TTS. The spin voice actors put on the text always distracts me. With text to speech I only get what's in the text itself.

I wrote a Perl/Tk GUI script for my file manager to manage text to speech through Festival 1.96 w/voice_nitech_us_awb_arctic_hts. Unlike neural network AI models it runs fine even on very slow machines.


I think Google's product has that: https://play.google.com/books/publish/autonarrated/


That sounds pretty bad though


As an avid consumer of audio books (150+/year) - we are well past the point where narrators are necessary. Professional audio books take too long to release, are too expensive, are concentrated on a limited number of platforms and just aren't THAT much better than the automated stuff for the long tail of books.


Audible doesn't allow AI narration or much Public Domain stuff at the moment. The only thing keeping it from happening is the markets trying to keep back a flood of crap from over taking / drowning / diluting the more well crafted options and causing the consumers to get really annoyed.


Let's be honest, the moment Amazon thinks their tts is good enough, they'll be offering AI audible deals to every author on their platform


The 80% solution: Pair with a professional narrator who has consented to have their voice modeled by this (see the note at the bottom about what they held back from open sourcing). This generates a beta, and then you can pay the human narrator to rework specific sections you’re unhappy with.


Yea, hard to say because the obvious implementation would be to just have it built into phones once the model is potentially portable enough - I see this happening quicker as a more general TTS functionality much like Google is doing with 'subtitles anywhere' aka Live Caption. Paired with translations we maybe pretty close to the universal translator type functionality. I could see end users being able to customize their voice assistant even more or maybe having multiple based on if its talking for you or to you.

Anyways the problem with this is it makes the product 'ai audiobook' basically worthless, why not just buy the eBook and have my personalized translator turn it into an audio book. Now you just have market differentiation between cheap ebook + ai narrator vs expensive + professional narration.

Though narration costs are already pretty cheap - it really does not factor into the cost of publishing an audio book that much unless its really a bottom of the barrel book.


Thinking about this more - the copyright implications become much more interesting once its no longer a recording. Does it could as a private performance if you have headphones on? Is it a public performance if you listen to live TTS through your speakers in public?


I'm looking forward to my on device TTS, but Amazon has a decent moat with the DRM on their Kindles.

At least they'll have to remain somewhat competitive once consumers decide they want the AI audiobooks and the like.


Sadly they didn't release the code or models


Agreed. It hardly feels worth even reading through the paper since, from my perspective, it may as well just be made up. I can also write "Hey guys I made a good TTS it's really cool and great and the voices sound really natural" and put some samples together. If I never release any code or models or anything, it may as well have not been published.


> really cool and great ... and put some samples together

There are samples on the page which demonstrate it completely failing.

Now as to whether you'd make that up is 4D chess.


The value of this stuff is going to zero. Don't worry about it.

Product over model.

Models and weights are a race to the bottom. Everyone is doing it and competing on data efficiency, methodology, MOS, etc. Groups all over are releasing their data and weights. It doesn't matter if Amazon doesn't, other labs will do it to get ahead and to get attention.

This is going to be entirely pedestrian within a year.

ElevenLabs is not a unicorn. It's an early-forming bubble.


It's for Your Own Good, don't you know


I'm so glad they are all so protective of my safety! Lord knows I'm a child incapable of controlling myself or having my own morals! /s


Are there any decent TTS models that can be ran locally that plugs into existing software like SAPI without too much lag?


Bark and Tortoise work fairly well. Bark does super fast inference[1] on my M1.

[1] https://github.com/SaladTechnologies/bark


@dvt Is this just a containerized version of Bark? Wondering if this repo has M1-specific improvements.


> Is this just a containerized version of Bark

I think so.


I'm finding M1 generation quite slow (CPU-only) on the stock Bark—any tips on speeding it up?


Sorry, haven't messed around too much with optimizations. I thought it was quite fast compared to Tortoise for example (where generation speed was at a 3:1 ratio).


I've used coqui.ai's TTS models[0] and library[1] to great success. I was able to get cloned voice to be rendered in about 80% of the audio clip length, and I believe you can also stream the response. Do note the model license for XTTS, it is one they wrote themselves that has some restrictions.

[0] https://huggingface.co/coqui/XTTS-v2

[1] https://github.com/coqui-ai/TTS


XTTS has a streaming mode with ~300ms latency and sounds good, though it has hallucination issues. StyleTTS2 sounds good and doesn't hallucinate as much. It doesn't support streaming but it's fast so it can still respond quickly. But neither of them sound as good as Eleven Labs or OpenAI or this one.


Open question: does anyone know of a TTS model which can synchronize the output to an SRT or other subtitle file?


To answer directly first: I don't know of any model with this built in.

To answer more generally: but it should be pretty straightforward to use any old TTS model, the subtitle timestamps, and set the according delay until the next subtitle change and get the same effect. The alternative (changing the speed of the generated voice) is also possible via the same method but the problem there, and the problem when directly driven by a model, is subtitles don't clue you in on when e.g. someone is talking slow or there was a pause in conversation so that subtitle staid up a little longer than a normal one. What you'd need to solve that is a model which takes both the video and the subtitle info, a bit more difficult.

Of course it's also a question about what the end goal is. It's pretty rare to have significant subtitles but no audio so if the ultimate goal was e.g. changing a actor's voice you'd probably get much better results with an audio->audio model than a TTS->audio model. Likely similar kinds of stories for many other use cases.


> Of course it's also a question about what the end goal is. It's pretty rare to have significant subtitles but no audio

I think the question was about dubbing a movie in another language, using SRT files.


Actually more about dubbing lectures!


Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.


xtts2 is great, but it looks like this model is probably more consistent with its output and has a better grasp of meaning in long texts.


> ... capable of mimicking speaker characteristics with just a few seconds of reference audio ... we have decided against open-sourcing this model as a precautionary measure.

Disappointed yet again.


Someone should send the developers this audio recording I have of Jeff Bezos saying that he changed his mind and wants the model to be released as open-source.


Looks like the website (amazon-ltts-paper.com) now redirects to amazon.science. They took out the "Ethical Statement" section. (The original page can still be accessed from the Wayback Machine: https://web.archive.org/web/20240215005705/https://amazon-lt...)


I would love an API for this.. any information on availability?


Ah, so that's where all the Alexa recordings went.


is there any open sourced library can reach the quality of Microsoft tts and support multi-language




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: