I just tried a new text-to-speech AI tool that clones your voice in seconds
OpenVoice is a new text-to-speech artificial intelligence technology that can clone any voice from a 30-second sample. And it keeps the tonal quality of that original voice as it turns your written text into spoken word audio.
Text-to-speech made it to my list of the most important AI tools of the most important AI tools of the year last year. This is a new take on that approach, speeding up time to copy a voice.
While it was able to create a clone of my voice almost instantly, the output made me sound American rather than my native English. It does however do a very good job if you start with a neutral American accent.
In one of the example clips it referenced a sample of Elon Musk speaking. When you type in random text for his cloned voice to repeat the sounds are softer, less South African and more Southern California. You can hear this for yourself further down the article.
How does OpenVoice work?
The multilingual OpenVoice from MyShell has been trained on hours of voice samples. This allows it to identify patterns and speed up the time required to clone a new voice.
It can replicate the tone color of the reference speaker and unlike other tools like ElevenLabs, gives the user control over emotion, accent, rhythm, pauses and intonation.
OpenVoice has already been in use to provide voice cloning for the MyShell AI tool since May, used by tens of millions of users around the world to create personal AI chatbots.
How does OpenVoice sound?
I have only tried it through the demos on Lepton and HuggingFace, so it isn’t a true trial as that would require installing and running it on my own machine. However, from that short sample the emotion changing works very well, as does cloning US-based voices.
It struggles with strong accents, although that could be due to the limitations of the demo rather than the model as a whole. However, the samples provided on the project website also seem to focus heavily on US accents.
What makes OpenVoice stand out?
The gold standard in voice cloning from a short sample, with accurate sounding results so far is ElevenLabs. The company also allows speech-to-speech to improve realism. However, it is a commercial and somewhat expensive option for experimenting and hobbyists.
OpenVoice is available to install and run locally. It is also capable of greater degrees of realism, or at least more animation in the generated voice. This could be invaluable for someone making a cartoon or radio play as a school project and can’t afford actors.
The more realistic voice AI gets, particularly when a voice can be cloned in seconds, the more actors unions will be on alert. The recent SAG-AFTRA was in-part about the use of AI to deprive creatives of work.
I think we’ll see a push to copyright more aspects of an identity including vocal tone, motion and performance as AI increasingly replicates those factors.