New Microsoft VALL-E AI Can Replicate A Voice With 3s Sample

January 20, 2023 | Automation
Microsoft

A team at Microsoft recently released a paper showing off their language modeling approach for text to speech synthesis. They trained a neural codec language model (called VALL-E) for 60,000 hours, resulting in a system that can take 3 second voice samples and create realistic sounding replications that closely mimic tone and speech inflection.

View the link below for many audio samples.

https://valle-demo.github.io/
https://arxiv.org/abs/2301.02111