Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products.
Artificial production of human speech is known as speech synthesis. This machine learning-based technique is applicable in text-to-speech, music generation, speech generation, speech-enabled devices, navigation systems, and accessibility for visually-impaired people.
Facebook’s voice synthesis AI generates speech in 500 milliseconds
Facebook today unveiled a highly efficient, AI text-to-speech (TTS) system that can be hosted in real time using regular processors. It’s currently powering Portal, the company’s brand of smart displays, and it’s available as a service for other apps, like VR, internally at Facebook.
In tandem with a new data collection approach, which leverages a language model for curation, Facebook says the system — which produces a second of audio in 500 milliseconds — enabled it to create a British-accented voice in six months as opposed to over a year for previous voices.
Most modern AI TTS systems require graphics cards, field-programmable gate arrays (FPGAs), or custom-designed AI chips like Google’s tensor processing units (TPUs) to run, train, or both. For instance, a recently detailed Google AI system was trained across 32 TPUs in parallel. Synthesizing a single second of humanlike audio can require outputting as many as 24,000 samples — sometimes even more. And this can be expensive; Google’s latest-generation TPUs cost between $2.40 and $8 per hour in Google Cloud Platform.
“The system … will play an important role in creating and scaling new voice applications that sound more human and expressive,” the company said in a statement. “We’re excited to provide higher-quality audio … so that we can more efficiently continue to bring voice interactions to everyone in our community.”
by Kyle Wiggers