Monday, April 16, 2018

Apple explains how Personalized Hey Siri works →

Apple’s latest entry into their Machine Learning Journal details how they personalized the Hey Siri trigger phrase for engaging the personal assistant. Here are a few interesting tidbits.

[…] Unintended activations occur in three scenarios – 1) when the primary user says a similar phrase, 2) when other users say “Hey Siri,” and 3) when other users say a similar phrase. The last one is the most annoying false activation of all. In an effort to reduce such False Accepts (FA), our work aims to personalize each device such that it (for the most part) only wakes up when the primary user says “Hey Siri.” […]

I love the candidness of the writers here. I can also relate to the primary scenario. Let’s just say I’ve learned how often I say the phrase “Are you serious?”, because about 75% of the time I do, Siri thinks I’m trying to activate her. It’s fairly annoying on multiple levels.

On Siri enrollment and learning:

[…] During explicit enrollment, a user is asked to say the target trigger phrase a few times, and the on-device speaker recognition system trains a PHS speaker profile from these utterances. This ensures that every user has a faithfully-trained PHS profile before he or she begins using the “Hey Siri” feature; thus immediately reducing IA rates. However, the recordings typically obtained during the explicit enrollment often contain very little environmental variability. […]

And:

This brings to bear the notion of implicit enrollment, in which a speaker profile is created over a period of time using the utterances spoken by the primary user. Because these recordings are made in real-world situations, they have the potential to improve the robustness of our speaker profile. The danger, however, lies in the handling of imposter accepts and false alarms; if enough of these get included early on, the resulting profile will be corrupted and not faithfully represent the primary users’ voice. The device might begin to falsely reject the primary user’s voice or falsely accept other imposters’ voices (or both!) and the feature will become useless.

Heh. Maybe this explains my “Are you serious?” problem.

They go on to explain improving speaker recognition, model training, and more. As with all of Apple’s Machine Learning Journal entries, this one is very technical in content, but these peeks behind the curtain are highly interesting to say the least.

One thing I didn’t see note of was how microphone quality and quantity improves recognition. For instance, Hey Siri works spookily-well on HomePod, with its seven microphones. However, I assume they aren’t using Personalized Hey Siri on HomePod, since it’s a communal device with multiple users, so the success rate may be implicitly higher already. Either way, I wish my iPhone would hear me just as well.

Friday, April 13, 2018

How Google identifies who's talking →

From the Google Research Blog

People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds. Known as the cocktail party effect, this capability comes natural to us humans. However, automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.

I hope Apple makes similar advances in this area. Identification by voice will open up so many possibilities.

Thursday, April 5, 2018

Apple hires John Giannandrea, Google’s AI chief →

Jack Nicas and Cade Metz for The New York Times:

Apple has hired Google’s chief of search and artificial intelligence, John Giannandrea, a major coup in its bid to catch up to the artificial intelligence technology of its rivals.

Apple said on Tuesday that Mr. Giannandrea will run Apple’s “machine learning and A.I. strategy,” and become one of 16 executives who report directly to Apple’s chief executive, Timothy D. Cook.

And:

“Our technology must be infused with the values we all hold dear,” Mr. Cook said Tuesday morning in an email to staff members obtained by The New York Times. “John shares our commitment to privacy and our thoughtful approach as we make computers even smarter and more personal.”

Now this is a huge hire. Apple must have ponied up big time, but I’m sure a guy like Giannandrea wouldn’t jump ship only because of money. Apple must have let on just enough about what he’s going to be working on for it to be worth it.

The note about privacy is a stark difference between how Google and Apple handle user data. The perception of Apple as lacking in AI/ML is mostly attributed to their hard privacy stances, whereas Google is more cavalier with user data. It’s going to be a hard problem for Apple and Giannandrea to solve, but all good things come with time. Apple is essentially saying we can have exceptional AI/ML and keep our privacy. That’s exactly what we need.

Sunday, August 27, 2017

How Siri’s voice has improved from iOS 9 to iOS 11 →

Earlier this week, Apple posted three new entries on their Machine Learning Journal detailing multiple aspects of how Siri has been improved over time.

The one linked above centers on how Siri’s voice has been vastly improved since iOS 9.

Starting in iOS 10 and continuing with new features in iOS 11, we base Siri voices on deep learning. The resulting voices are more natural, smoother, and allow Siri’s personality to shine through.

How speech synthesis works:

Building a high-quality text-to-speech (TTS) system for a personal assistant is not an easy task. The first phase is to find a professional voice talent whose voice is both pleasant and intelligible and fits the personality of Siri. In order to cover some of the vast variety of human speech, we first need to record 10—20 hours of speech in a professional studio. The recording scripts vary from audio books to navigation instructions, and from prompted answers to witty jokes. Typically, this natural speech cannot be used as it is recorded because it is impossible to record all possible utterances the assistant may speak.

This next figure illustrates how speech synthesis works via the selection of half-phones for each part of speech:

Unit selection speech synthesis using half-phones.
Figure 1. Illustration of unit selection speech synthesis using half-phones. The synthesized utterance “Unit selection synthesis” and its phonetic transcription using half-phones are shown at the top of the figure. The corresponding synthetic waveform and its spectrogram are shown below. The speech segments delimited by the lines are continuous speech segments from the database that may contain one or more half-phones.

On Siri’s new iOS 11 voice:

For iOS 11, we chose a new female voice talent with the goal of improving the naturalness, personality, and expressivity of Siri’s voice. We evaluated hundreds of candidates before choosing the best one. Then, we recorded over 20 hours of speech and built a new TTS voice using the new deep learning based TTS technology. As a result, the new US English Siri voice sounds better than ever. Table 1 contains a few examples of the Siri deep learning -based voices in iOS 11 and 10 compared to a traditional unit selection voice in iOS 9.

Make sure you check out the audio comparisons on the page from iOS 9 — iOS 11. After using Siri extensively on iOS 11, I can truly say the new voice is better than ever, and absolutely more natural and expressive.

Reading these journal entries really makes you realize how difficult speech recognition really is. Maybe we can go a little easier on Siri when she doesn’t understand or perform exactly how we expect every time. To err is human, and digital assistants are becoming increasingly more human-like, after all.

I’m super excited to see how Siri and other assistants improve over the next few years. I think we’re going to see the bar raised exponentially thanks to machine learning.

Other Posts

Tuesday, August 22, 2017

Apple’s Core ML Brings AI to the Masses →

Gene Munster for Loup Ventures:

In June Apple announced Core ML, a platform that allows app developers to easily integrate machine learning (ML) into an app. Of the estimated 2.4m apps available on the App Store, we believe less than 1% leverage ML today – but not for long. We believe Core ML will be a driving force in bringing machine learning to the masses in the form of more useful and insightful apps that run faster and respect user privacy.

While not a complete list, Apple has since used AI in the following areas:

  • Facial recognition in photos
  • Next word prediction on the iOS keyboard
  • Smart responses on the Apple Watch
  • Handwriting interpretation on the Apple Watch
  • Chinese handwriting recognition
  • Drawing based on pencil pressure on the iPad
  • Extending iPhone battery life by modifying when data is refreshed (hard to imagine that our iPhone batteries would be even worse if not for AI)

On Machine Learning differences between Apple and Android:

  • Speed. ML on Apple is processed locally which speeds up the app. Typically, Android apps process ML in the cloud. Apple can process ML locally because app developers can easily test the hardware running the app (iOS devices). In an Android world, hardware fragmentation makes it harder for app developers to run ML locally.
  • Availability. Core ML powered apps are always available, even without network connectivity. Android ML powered apps can require network connectivity, which limits their usability.
  • Privacy. Apple’s privacy values are woven into Core ML; terms and conditions do not allow Apple to see any user data captured by an app. For example, if you take a picture using an app that is powered by Core ML’s vision, Apple won’t see the photo. If a message is read using an app powered by Core ML’s natural language processor, the contents won’t be sent to Apple. This differs from Android apps, which typically share their data with Google as part of their terms and conditions.

Excellent overview of CoreML by Gene. As I said in my post on Apple’s Machine Learning Journal, Apple is investing heavily in Machine Learning. While their their stock as the privacy tech company may slow them at times, they easily make up for it in customer satisfaction and end-run adoption by developers. Apple is perfectly fine being the tortoise (as opposed to the hare) when it comes to solving problems the right way.

Machine Learning and AR are going to be transformative for technology and how it affects our daily lives.

Wednesday, July 19, 2017

Apple’s Machine Learning Journal →

Today, Apple announced a new journal (read: blog) to catalog their machine learning findings.

Welcome to the Apple Machine Learning Journal. Here, you can read posts written by Apple engineers about their work using machine learning technologies to help build innovative products for millions of people around the world. If you’re a machine learning researcher or student, an engineer or developer, we’d love to hear your questions and feedback. Write us at [email protected]

In the first entry, they discuss improving the realism of synthetic images by using large, diverse, and accurately annotated training sets.

Most successful examples of neural nets today are trained with supervision. However, to achieve high accuracy, the training sets need to be large, diverse, and accurately annotated, which is costly. An alternative to labelling huge amounts of data is to use synthetic images from a simulator. This is cheap as there is no labeling cost, but the synthetic images may not be realistic enough, resulting in poor generalization on real test images. To help close this performance gap, we’ve developed a method for refining synthetic images to make them look more realistic. We show that training models on these refined images leads to significant improvements in accuracy on various machine learning tasks.

They go into explaining the challenges and methods used to refine synthetic images, demonstrated by the example figure below.

Unlabeled Real Images
Figure 1. The task is to learn a model that improves the realism of synthetic images from a simulator using unlabeled real data, while preserving the annotation information.

The post is fascinating. Alternate reality and machine learning are the next frontier for computing, and a growing focus for Apple. This is demonstrated by iOS 11’s ARKit and CoreML, which allows developers to easily implement these technologies into their apps. In a recent interview with Bloomberg, Tim Cook talked about autonomous systems and Apple’s focus on them, including software for self-driving cars, calling it “the mother of all AI projects”.

Some are worried Apple is limiting themselves in these areas because of their privacy and security standpoints. It’s a self-imposed limitation, yes, but that could be why they are being more open about publishing their findings on efforts in this space—to attract like-minded individuals who have the same passion and belief system. For example, all machine learning features on iOS right now are done on-device. No identifiable data is sent back to iCloud and analyzed by a super computer to suggest similar faces in the Photos app, for instance. It’s all done by your iPhone or iPad. Mark Gurman even reported back in May that Apple is developing an ‘AI’ chip to specifically handle these tasks, similar to how the motion co-processor handles all motion data. Makes total sense.

I would much rather have the comfort knowing my device is doing all the work if it comes at a cost of speed to market. Besides, it’s only an inevitability that our machines will do more for us on their own. Apple may take a little more time to get there, but that’s their M.O. iPhone wasn’t the first smartphone, Apple Watch wasn’t the first smartwatch, but both products are now the benchmark in their markets. Apple will do this right, as opposed to other companies who live on getting their hands on your data, and it will be the benchmark for machine learning privacy.

Apple is indeed a secretive company, but under Tim Cook’s direction we are seeing them embrace the ability to be more open. One prior example is the open sourcing of Swift. It makes me excited to see what will come next as a result of this openness.