How does Speech Recognition work, and how can it help us teach English? (Part 1)

Automatic Speech Recognition (ASR) seems to be everywhere these days, from your smart fridge, to your smart phone, and every device in between. But how does it actually work, and how can it be utilized by teachers of English?

In the first part of this blog post, we learn how speech is transformed from vibrations in the air to text on your screen. In the second part (coming soon!), we take a look at some of the ways speech recognition can be used as a teaching and testing tool in English language pedagogy.

Step 1. Analog to digital

Humans live in an analog world. When we speak to each other, we don’t transmit streams of numbers to each other; we vibrate our vocal chords, which create sound waves that vibrate other people’s eardrums, which send electrical signals into the brain, which the brain interprets as words. Unfortunately, computers can’t process sound waves without first converting them into a digital form, i.e. a stream of numbers. 

This is exactly what a microphone does. A microphone is basically an analog-to-digital converter (ADC), which changes vibrations in the air into electrical signals that can be represented by numbers. However, this is all a microphone can do. It can convert an analog audio wave into a digital stream of numbers, but it has no idea what words (or other sounds) those numbers represent.

In order to recognize words, we need a computer program that can break the recorded sound down into its individual phonemes, and then connect those phonemes into the most likely combinations to form words.

Step 2. Identifying phonemes

A phoneme is the smallest significant part of a spoken word. The word “cat”, for example, consists of three phonemes, transcribed in ARPABET as: 

K AE T

What rule can we specify to allow our computer to determine whether a certain segment of a sound recording is the phoneme “AE” in “cat”? It is not an exact science. Different speakers pronounce the “AE” phoneme differently depending on their accent, their tone of voice, their vocal timbre, their age, gender, and even emotional state.

Instead of trying to come up with a rule for what the “AE” phoneme sounds like, we can feed a Machine Learning (ML) algorithm thousands of hours of English speech, and allow it to figure out for itself what the “AE” phoneme is supposed to sound like. Then we can ask the algorithm:

Given that these sounds are all “AE”, is this sound also “AE”?

An important point to note here is that the algorithm is not trying to figure out which phonemes individual words are made up of. This process has already been completed by language experts, who have released dictionaries of word-phoneme mappings that can be used to train speech recognition engines.

What the ML algorithm is trying to do is map sounds to phonemes, and then connect those phonemes into the most likely combinations to form words.

It does this by chopping up phonetically annotated sound clips into very short (25ms) frames. Each frame is converted to a set of numbers which represent the different sound frequencies in the frame. The ML algorithm then learns to associate certain frames or combinations of frames with the corresponding parts of the phonetic transcription.

Every time the training program encounters the “AE” phoneme, it accommodates the new example in its Acoustic Model (AM) of the sound, thereby building up a comprehensive representation of what the “AE” phoneme should sound like.

Step 3. Connecting phonemes

Once the algorithm has processed all of the training data, we can then ask it to identify an audio recording of the word “cat”. It will break the recording down and analyze it, as described above, it an attempt to identify its constituent phonemes.

However, because some phonemes (and consequently some words) have incredibly similar pronunciations, sometimes the computer’s best guess at the recording’s constituent phonemes isn’t accurate enough for reliable speech recognition. Fortunately, there is a way to improve the computer’s accuracy.

We can narrow down the possible phoneme choices by employing a statistical algorithm called Hidden Markov Model (HMM). HMM uses statistical probability to determine the likelihood of a future state (the next phoneme in the sound) given a current state (the current phoneme in the sound). 

When it comes to phonemes in the English language, certain combinations are much more likely than other combinations. For example, “Z” in “zebra” never follows the phoneme “C” in “cat”, but “AE” in “cat” often follows “C” in “cat”.

When a speech recognizer is attempting to map a sound to its constituent words and phonemes, it will give precedence to likely combinations of words and phonemes over unlikely or impossible combinations. It knows what the likely combinations are by referring to a large database of phonetically transcribed recordings, known as the Language Model (LM).

For example, the sentence “Dolphins swim” is much more likely to occur in the English language than “Doll fins swim”, even though “dolphins” and “doll fins” are comprised of exactly the same sequences of phonemes.

Step 4. Hello computer!

We now have a computer program that can analyze recorded sound and convert it into the most likely sequence of words.

But how does all of this help English learners to improve their speaking skills? Read Part 2 to find out! (Coming soon!)