How Siri’s voice has improved from iOS 9 to iOS 11 →

Earlier this week, Apple posted three new entries on their Machine Learning Journal detailing multiple aspects of how Siri has been improved over time.

The one linked above centers on how Siri’s voice has been vastly improved since iOS 9.

Starting in iOS 10 and continuing with new features in iOS 11, we base Siri voices on deep learning. The resulting voices are more natural, smoother, and allow Siri’s personality to shine through.

How speech synthesis works:

Building a high-quality text-to-speech (TTS) system for a personal assistant is not an easy task. The first phase is to find a professional voice talent whose voice is both pleasant and intelligible and fits the personality of Siri. In order to cover some of the vast variety of human speech, we first need to record 10—20 hours of speech in a professional studio. The recording scripts vary from audio books to navigation instructions, and from prompted answers to witty jokes. Typically, this natural speech cannot be used as it is recorded because it is impossible to record all possible utterances the assistant may speak.

This next figure illustrates how speech synthesis works via the selection of half-phones for each part of speech:

Unit selection speech synthesis using half-phones.

Figure 1. Illustration of unit selection speech synthesis using half-phones. The synthesized utterance “Unit selection synthesis” and its phonetic transcription using half-phones are shown at the top of the figure. The corresponding synthetic waveform and its spectrogram are shown below. The speech segments delimited by the lines are continuous speech segments from the database that may contain one or more half-phones.

On Siri’s new iOS 11 voice:

For iOS 11, we chose a new female voice talent with the goal of improving the naturalness, personality, and expressivity of Siri’s voice. We evaluated hundreds of candidates before choosing the best one. Then, we recorded over 20 hours of speech and built a new TTS voice using the new deep learning based TTS technology. As a result, the new US English Siri voice sounds better than ever. Table 1 contains a few examples of the Siri deep learning -based voices in iOS 11 and 10 compared to a traditional unit selection voice in iOS 9.

Make sure you check out the audio comparisons on the page from iOS 9 — iOS 11. After using Siri extensively on iOS 11, I can truly say the new voice is better than ever, and absolutely more natural and expressive.

Reading these journal entries really makes you realize how difficult speech recognition really is. Maybe we can go a little easier on Siri when she doesn’t understand or perform exactly how we expect every time. To err is human, and digital assistants are becoming increasingly more human-like, after all.

I’m super excited to see how Siri and other assistants improve over the next few years. I think we’re going to see the bar raised exponentially thanks to machine learning.

Other Posts

Apple’s Core ML Brings AI to the Masses →

Gene Munster for Loup Ventures:

In June Apple announced Core ML, a platform that allows app developers to easily integrate machine learning (ML) into an app. Of the estimated 2.4m apps available on the App Store, we believe less than 1% leverage ML today – but not for long. We believe Core ML will be a driving force in bringing machine learning to the masses in the form of more useful and insightful apps that run faster and respect user privacy.

While not a complete list, Apple has since used AI in the following areas:

  • Facial recognition in photos
  • Next word prediction on the iOS keyboard
  • Smart responses on the Apple Watch
  • Handwriting interpretation on the Apple Watch
  • Chinese handwriting recognition
  • Drawing based on pencil pressure on the iPad
  • Extending iPhone battery life by modifying when data is refreshed (hard to imagine that our iPhone batteries would be even worse if not for AI)

On Machine Learning differences between Apple and Android:

  • Speed. ML on Apple is processed locally which speeds up the app. Typically, Android apps process ML in the cloud. Apple can process ML locally because app developers can easily test the hardware running the app (iOS devices). In an Android world, hardware fragmentation makes it harder for app developers to run ML locally.
  • Availability. Core ML powered apps are always available, even without network connectivity. Android ML powered apps can require network connectivity, which limits their usability.
  • Privacy. Apple’s privacy values are woven into Core ML; terms and conditions do not allow Apple to see any user data captured by an app. For example, if you take a picture using an app that is powered by Core ML’s vision, Apple won’t see the photo. If a message is read using an app powered by Core ML’s natural language processor, the contents won’t be sent to Apple. This differs from Android apps, which typically share their data with Google as part of their terms and conditions.

Excellent overview of CoreML by Gene. As I said in my post on Apple’s Machine Learning Journal, Apple is investing heavily in Machine Learning. While their their stock as the privacy tech company may slow them at times, they easily make up for it in customer satisfaction and end-run adoption by developers. Apple is perfectly fine being the tortoise (as opposed to the hare) when it comes to solving problems the right way.

Machine Learning and AR are going to be transformative for technology and how it affects our daily lives.

Apple’s Machine Learning Journal →

Today, Apple announced a new journal (read: blog) to catalog their machine learning findings.

Welcome to the Apple Machine Learning Journal. Here, you can read posts written by Apple engineers about their work using machine learning technologies to help build innovative products for millions of people around the world. If you’re a machine learning researcher or student, an engineer or developer, we’d love to hear your questions and feedback. Write us at [email protected]

In the first entry, they discuss improving the realism of synthetic images by using large, diverse, and accurately annotated training sets.

Most successful examples of neural nets today are trained with supervision. However, to achieve high accuracy, the training sets need to be large, diverse, and accurately annotated, which is costly. An alternative to labelling huge amounts of data is to use synthetic images from a simulator. This is cheap as there is no labeling cost, but the synthetic images may not be realistic enough, resulting in poor generalization on real test images. To help close this performance gap, we’ve developed a method for refining synthetic images to make them look more realistic. We show that training models on these refined images leads to significant improvements in accuracy on various machine learning tasks.

They go into explaining the challenges and methods used to refine synthetic images, demonstrated by the example figure below.

Unlabeled Real Images

Figure 1. The task is to learn a model that improves the realism of synthetic images from a simulator using unlabeled real data, while preserving the annotation information.

The post is fascinating. Alternate reality and machine learning are the next frontier for computing, and a growing focus for Apple. This is demonstrated by iOS 11’s ARKit and CoreML, which allows developers to easily implement these technologies into their apps. In a recent interview with Bloomberg, Tim Cook talked about autonomous systems and Apple’s focus on them, including software for self-driving cars, calling it “the mother of all AI projects”.

Some are worried Apple is limiting themselves in these areas because of their privacy and security standpoints. It’s a self-imposed limitation, yes, but that could be why they are being more open about publishing their findings on efforts in this space—to attract like-minded individuals who have the same passion and belief system. For example, all machine learning features on iOS right now are done on-device. No identifiable data is sent back to iCloud and analyzed by a super computer to suggest similar faces in the Photos app, for instance. It’s all done by your iPhone or iPad. Mark Gurman even reported back in May that Apple is developing an ‘AI’ chip to specifically handle these tasks, similar to how the motion co-processor handles all motion data. Makes total sense.

I would much rather have the comfort knowing my device is doing all the work if it comes at a cost of speed to market. Besides, it’s only an inevitability that our machines will do more for us on their own. Apple may take a little more time to get there, but that’s their M.O. iPhone wasn’t the first smartphone, Apple Watch wasn’t the first smartwatch, but both products are now the benchmark in their markets. Apple will do this right, as opposed to other companies who live on getting their hands on your data, and it will be the benchmark for machine learning privacy.

Apple is indeed a secretive company, but under Tim Cook’s direction we are seeing them embrace the ability to be more open. One prior example is the open sourcing of Swift. It makes me excited to see what will come next as a result of this openness.