How does Speech Recognition work, and how can it help us teach English? (Part 1)

Automatic Speech Recognition (ASR) seems to be everywhere these days, from your smart fridge, to your smart phone, and every device in between. But how does it actually work, and how can it be utilized by teachers of English?

In the first part of this blog post, we learn how speech is transformed from vibrations in the air to text on your screen. In the second part (coming soon!), we take a look at some of the ways speech recognition can be used as a teaching and testing tool in English language pedagogy.

Step 1. Analog to digital

Humans live in an analog world. When we speak to each other, we don’t transmit streams of numbers to each other; we vibrate our vocal chords, which create sound waves that vibrate other people’s eardrums, which send electrical signals into the brain, which the brain interprets as words. Unfortunately, computers can’t process sound waves without first converting them into a digital form, i.e. a stream of numbers. 

This is exactly what a microphone does. A microphone is basically an analog-to-digital converter (ADC), which changes vibrations in the air into electrical signals that can be represented by numbers. However, this is all a microphone can do. It can convert an analog audio wave into a digital stream of numbers, but it has no idea what words (or other sounds) those numbers represent.

In order to recognize words, we need a computer program that can break the recorded sound down into its individual phonemes, and then connect those phonemes into the most likely combinations to form words.

Step 2. Identifying phonemes

A phoneme is the smallest significant part of a spoken word. The word “cat”, for example, consists of three phonemes, transcribed in ARPABET as: 


What rule can we specify to allow our computer to determine whether a certain segment of a sound recording is the phoneme “AE” in “cat”? It is not an exact science. Different speakers pronounce the “AE” phoneme differently depending on their accent, their tone of voice, their vocal timbre, their age, gender, and even emotional state.

Instead of trying to come up with a rule for what the “AE” phoneme sounds like, we can feed a Machine Learning (ML) algorithm thousands of hours of English speech, and allow it to figure out for itself what the “AE” phoneme is supposed to sound like. Then we can ask the algorithm:

Given that these sounds are all “AE”, is this sound also “AE”?

An important point to note here is that the algorithm is not trying to figure out which phonemes individual words are made up of. This process has already been completed by language experts, who have released dictionaries of word-phoneme mappings that can be used to train speech recognition engines.

What the ML algorithm is trying to do is map sounds to phonemes, and then connect those phonemes into the most likely combinations to form words.

It does this by chopping up phonetically annotated sound clips into very short (25ms) frames. Each frame is converted to a set of numbers which represent the different sound frequencies in the frame. The ML algorithm then learns to associate certain frames or combinations of frames with the corresponding parts of the phonetic transcription.

Every time the training program encounters the “AE” phoneme, it accommodates the new example in its Acoustic Model (AM) of the sound, thereby building up a comprehensive representation of what the “AE” phoneme should sound like.

Step 3. Connecting phonemes

Once the algorithm has processed all of the training data, we can then ask it to identify an audio recording of the word “cat”. It will break the recording down and analyze it, as described above, it an attempt to identify its constituent phonemes.

However, because some phonemes (and consequently some words) have incredibly similar pronunciations, sometimes the computer’s best guess at the recording’s constituent phonemes isn’t accurate enough for reliable speech recognition. Fortunately, there is a way to improve the computer’s accuracy.

We can narrow down the possible phoneme choices by employing a statistical algorithm called Hidden Markov Model (HMM). HMM uses statistical probability to determine the likelihood of a future state (the next phoneme in the sound) given a current state (the current phoneme in the sound). 

When it comes to phonemes in the English language, certain combinations are much more likely than other combinations. For example, “Z” in “zebra” never follows the phoneme “C” in “cat”, but “AE” in “cat” often follows “C” in “cat”.

When a speech recognizer is attempting to map a sound to its constituent words and phonemes, it will give precedence to likely combinations of words and phonemes over unlikely or impossible combinations. It knows what the likely combinations are by referring to a large database of phonetically transcribed recordings, known as the Language Model (LM).

For example, the sentence “Dolphins swim” is much more likely to occur in the English language than “Doll fins swim”, even though “dolphins” and “doll fins” are comprised of exactly the same sequences of phonemes.

Step 4. Hello computer!

We now have a computer program that can analyze recorded sound and convert it into the most likely sequence of words.

But how does all of this help English learners to improve their speaking skills? Read Part 2 to find out! (Coming soon!)

20 Tech Tips from JALT CALL 2019

The 2019 JALT CALL conference was informative and enjoyable as usual! Here are some handy highlights and tech tips I picked up during the three days of presentations…

  1. The big names that come up every year include English Central, WordEngine, Pocket Passport, and XReading. Check them out if you don’t already know them!
  2. Did you know you can use MoodleCloud to host your Moodle installation?
  3. According to English Central, “difficulties”, “colony”, and “discovered” are amongst the words Japanese learners of English find the most difficult to pronounce
  4. The University of Kyoto is using blockchain to power its learning analytics. Find out more about the uses of blockchain here
  5. Kai-Fu Lee discusses AI in his best-selling book “AI Super Powers
  6. Musio X robot helps Japanese kids learn English
  7. Google Duplex can call local businesses to arrange appointments
  8. Pocket Talk puts the power of two-way voice translation in your pocket
  9. Translatotron can translate L1 speech directly into L2 speech without the need for an intermediary text transcription stage
  10. Critical thinking, people management, and creativity will be among the top 10 job skills in 2020 according to the World Economic Forum
  11. DialogFlow can be used to create natural AI-powered “conversation experiences”
  12. Seesaw empowers students to demonstrate and share learning
  13. Google Classroom is gaining traction in Japan, although I experienced issues inviting students from certain institutions that hadn’t yet granted access to the tool
  14. Did you know that Google complies with the EU’s General Data Protection Regulations (GDPR)?
  15. Did you know that there are 118 million smart speakers in US households?
  16. Alexa Skill Blueprints allow you to easily create your own Alexa Skill
  17. contains lots of useful text analysis tools
  18. Learner English corpora include ICLE, JEFFL, and many others
  19. There are also many native speaker corpora
  20. Manaba is a popular LMS in Japan

Also, don’t forget to check out my own sites:

… and buy my book if you’re interested in learning more about how to use tech in the ESL classroom!

My appearance on

I was delighted to appear on the excellent and informative podcast with James last weekend. The episode has just been released, and I talk about Computer Assisted Language Learning, writing graded readers, and teaching at universities in Japan.

Here is a quick rundown of my sites mentioned on the podcast:

Also, don’t forget to check out my book of tech tips for English teachers, and support me on Patreon if you find my work useful.

25 Tech Tips from JALT 2018


  1. Did you know that Google Slides now offers a feature to display automatic closed captions (i.e. it types what you say as you speak!)
  2. John Blake offers a variety of web-based tools for language teachers and learners
  3. Regex101 helps you write regular expressions
  4. Google Sites can be used to store handouts, host video and audio, link to useful websites, and a range of other useful functions
  5. Did you know that it’s possible to slow down and speed up YouTube videos?
  6. Google has a vocabulary learning activity built in to its mobile search portal
  7. Ozdic is a useful collocation dictionary
  8. YouGlish allows students to use YouTube to improve their English pronunciation
  9. The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate
  10. Gapminder produces free teaching resources making the world understandable based on reliable statistics
  11. ToPhonetics allows you to convert English text to IPA phonetic transcription
  12. Clip2Comic (iOS app) is a useful app for converting photos to comics for storytelling and other educational purposes
  13. CNN now produces reading and listening lessons for English learners, including text-to-speech audio and a vocabulary look-up feature…
  14. … while DreamReader provides free English reading practice for learners
  15. PowToon allows you to create engaging animated videos with a library of styles, characters, backgrounds
  16. Unsplash provides free stock photography for any purpose…
  17. …as does Pexels
  18. Cambridge World of Better Learning provides insights, tips and tools for language teachers
  19. Pocket Passport provides flashcards, storyboards, digital quizzes, and other resources for English language teachers
  20. EnglishCentral offers a series of high frequency vocabulary lists to help identify gaps in students’ knowledge, in addition to an online vocabulary level check
  21. Just-in-time learning involves using technology to consume learning materials at any time and in any place
  22. Did you know that MEXT officially promotes the use of ICT for active learning and for increasing the amount of time spent engaging with foreign languages?
  23. Scott Sustenance has developed an innovative system based on “mnemotechnics” (a.k.a. the “keyword method”) for enhancing students’ vocabulary recall ability. Check out his students’ work on his Instagram feed: #kwvocab18
  24. Nearpod provides a variety of real-time activities suitable for language classrooms, including open ended questions, fill-in-the-blanks, matching activities, and more
  25. The Font is an online journal of quality writing on the theme of teaching and learning languages at home and abroad

If you found these tips useful, why not check out the new version of my book, which has been revised, updated and expanded for 2019: 50 Ways to Teach with Technology

Inspirational speakers of English (according to Japanese college students)

I asked my Keio Study Skills students to produce a list of what they considered to be “inspirational public speakers”. The speakers on the list had to be able to speak English, but not necessarily as a first language. The list had to include both male and female speakers. This is the list they came up with:

  • Donald Trump (US President)
  • Malala Yousafzai (Activist and Nobel Peace Prize laureate)
  • Hillary Clinton (US politician)
  • Michelle Obama (US politician)
  • Martin Luther King Jr. (US civil rights activist)
  • Mark Zuckerberg (Founder of Facebook)
  • George Bush (ex-US President)
  • Charlie Chaplin (Actor and comedian)
  • Christel Takigawa (Japanese television announcer, Tokyo Olympics spokesperson)
  • Bill Gates (Founder of Microsoft)
  • Aung San Suu Kyi (Politician, diplomat, author, and Nobel Peace Prize laureate)
  • J. F. Kennedy (ex-US President)
  • Emma Gonzales (Survivor of Stoneman Douglas High School shooting)
  • Hiroshi Mikitani (CEO of Rakuten)
  • Margaret Thatcher (ex-UK Prime Minister)
  • Winston Churchill (ex-UK Prime Minister)

I pointed out that George Bush was famous for his English grammatical mistakes, and that Donald Trump, while inspirational to many, is probably not the best role model for public speaking. I also couldn’t find a video of Christel Takigawa speaking English (she mainly speaks French and Japanese) so replaced her with a clip of Masato Mizuno, who also spoke in favor of the Tokyo 2020 Olympics.

Other than these unsuitable choices, the list is not a bad selection of different speakers and speaking styles, and it represents both male and female, as well as native and non-native speakers of English.

Of course the reason many of the speakers are on this list is not necessarily because they are inspirational speakers per se, but rather because they have had inspirational experiences, or are conveying an inspirational message, or have achieved a high level of business or political success. The availability heuristic also had an obvious impact on selection of speakers, especially considering the high number of US and UK presidents and Prime Ministers.

I made a short compilation video featuring some of the speakers from the list to help inspire students for their own presentations: