So one of our readers asked us this question.
How does Google Now, Siri and Cortana recognize the words a person is saying?
In the 1970s, Carnegie Mellon University made a system called the Harpy System, which used trees of syllables to help computers understand that pronouncing tomato as toe-mah-toe and toe-may-toe are the same word. The Harpy system was a real breakthrough because previously computers would only understand numbers spoken by one person.
Nowadays, most systems use the Hidden Markov Model, which uses a more mathematical approach towards analyzing what you said. Google Now (not sure about Siri or Cortana) splits your sentence into it’s separate phonemes (the syllables you pronounced). As you speak, it talks to a server and analyzes the syllables pronounced. Then, it returns the word it thinks you just spoke. Unfortunately, accents can alter the pronunciation of words, so the system has to learn how to understand your voice over time. Google has a nice 7 minute documentary about this called “Behind the Mic: The Science of Talking with Computers”: https://www.youtube.com/watch?v=yxxRAHVtafI
Most services like this use personalized training. Google voice services does use personalized training. It trains itself over time instead of requiring you to do a specific training exercise. Users don’t want to sit through training. Users let other people use their voice activation, which sabotages implicit training. Therefore, they must rely on more generic voice models that work for everyone.
This leaves some people, called “goats” by the speech recognition scientists, unable to use speech recognition very well. They may not sound particularly unintelligible to you and me, but computers just can’t understand them because they fall outside the mathematical model.
Using a system similar to predictive typing, they use language models to fit syllables to the words you’re saying. They guess you meant “through the looking glass” rather than “threw the looking ass”, for instance.
This is one reason these services are very, very bad at mixing mismatched accents, languages, and idioms. An Indian man with a thick accent throwing “et voila” into the middle of an English sentence on a phone that thinks it’s en-US? Not going to work.
Google uses your email and other interactions with Google to ‘weigh’ the predictions more than it learns your speech patterns.
Baidu has now ditched some of the speech recognition techniques mentioned we have mentioned here. They instead rely on an Artificial Neural Network that they call Deep Speech.
This is an overview of the processing:
- Generate spectrogram of the speech (this gives the strength of different frequencies over time).
- Give the spectrogram to the Deep Speech model.
- The Deep Speech model will read slices in time of the spectrogram.
- Information about that slice of time is transformed into some learned internal representation.
- That internal representation is passed into layers of the network that have a form of memory. (this is so Deep Speech can use previous, and later, sound segments to inform decisions).
- This new internal representation is used by the final layers to predict the Letter that occured in that slice of time.
A little more simply:
- Put spectrogram of speech into Deep Speech
- Deep Speech gives probabilities of letters over that time.