Monday, April 16, 2018

Apple explains how Personalized Hey Siri works →

Apple’s latest entry into their Machine Learning Journal details how they personalized the Hey Siri trigger phrase for engaging the personal assistant. Here are a few interesting tidbits.

[…] Unintended activations occur in three scenarios – 1) when the primary user says a similar phrase, 2) when other users say “Hey Siri,” and 3) when other users say a similar phrase. The last one is the most annoying false activation of all. In an effort to reduce such False Accepts (FA), our work aims to personalize each device such that it (for the most part) only wakes up when the primary user says “Hey Siri.” […]

I love the candidness of the writers here. I can also relate to the primary scenario. Let’s just say I’ve learned how often I say the phrase “Are you serious?”, because about 75% of the time I do, Siri thinks I’m trying to activate her. It’s fairly annoying on multiple levels.

On Siri enrollment and learning:

[…] During explicit enrollment, a user is asked to say the target trigger phrase a few times, and the on-device speaker recognition system trains a PHS speaker profile from these utterances. This ensures that every user has a faithfully-trained PHS profile before he or she begins using the “Hey Siri” feature; thus immediately reducing IA rates. However, the recordings typically obtained during the explicit enrollment often contain very little environmental variability. […]

And:

This brings to bear the notion of implicit enrollment, in which a speaker profile is created over a period of time using the utterances spoken by the primary user. Because these recordings are made in real-world situations, they have the potential to improve the robustness of our speaker profile. The danger, however, lies in the handling of imposter accepts and false alarms; if enough of these get included early on, the resulting profile will be corrupted and not faithfully represent the primary users’ voice. The device might begin to falsely reject the primary user’s voice or falsely accept other imposters’ voices (or both!) and the feature will become useless.

Heh. Maybe this explains my “Are you serious?” problem.

They go on to explain improving speaker recognition, model training, and more. As with all of Apple’s Machine Learning Journal entries, this one is very technical in content, but these peeks behind the curtain are highly interesting to say the least.

One thing I didn’t see note of was how microphone quality and quantity improves recognition. For instance, Hey Siri works spookily-well on HomePod, with its seven microphones. However, I assume they aren’t using Personalized Hey Siri on HomePod, since it’s a communal device with multiple users, so the success rate may be implicitly higher already. Either way, I wish my iPhone would hear me just as well.