Earlier this week, Apple posted three new entries on their Machine Learning Journal detailing multiple aspects of how Siri has been improved over time.
The one linked above centers on how Siri’s voice has been vastly improved since iOS 9.
Starting in iOS 10 and continuing with new features in iOS 11, we base Siri voices on deep learning. The resulting voices are more natural, smoother, and allow Siri’s personality to shine through.
How speech synthesis works:
Building a high-quality text-to-speech (TTS) system for a personal assistant is not an easy task. The first phase is to find a professional voice talent whose voice is both pleasant and intelligible and fits the personality of Siri. In order to cover some of the vast variety of human speech, we first need to record 10—20 hours of speech in a professional studio. The recording scripts vary from audio books to navigation instructions, and from prompted answers to witty jokes. Typically, this natural speech cannot be used as it is recorded because it is impossible to record all possible utterances the assistant may speak.
This next figure illustrates how speech synthesis works via the selection of half-phones for each part of speech:
On Siri’s new iOS 11 voice:
For iOS 11, we chose a new female voice talent with the goal of improving the naturalness, personality, and expressivity of Siri’s voice. We evaluated hundreds of candidates before choosing the best one. Then, we recorded over 20 hours of speech and built a new TTS voice using the new deep learning based TTS technology. As a result, the new US English Siri voice sounds better than ever. Table 1 contains a few examples of the Siri deep learning -based voices in iOS 11 and 10 compared to a traditional unit selection voice in iOS 9.
Make sure you check out the audio comparisons on the page from iOS 9 — iOS 11. After using Siri extensively on iOS 11, I can truly say the new voice is better than ever, and absolutely more natural and expressive.
Reading these journal entries really makes you realize how difficult speech recognition really is. Maybe we can go a little easier on Siri when she doesn’t understand or perform exactly how we expect every time. To err is human, and digital assistants are becoming increasingly more human-like, after all.
I’m super excited to see how Siri and other assistants improve over the next few years. I think we’re going to see the bar raised exponentially thanks to machine learning.
- Acoustic Models by Cross-bandwidth and Cross-lingual Initialization
- Discusses how to improve voice recognition on low-bandwidth frequencies (e.g. Bluetooth devices).
- Inverse Text Normalization as a Labeling Problem
- Discusses how Siri formats entities like dates, times, and addresses, and how this can be formulated as a labeling problem.