The War Against Machine Translation

56adctw
LINE’s machine translation function can easily be confused for idle chat, but in fact it is potentially much more harmful

The problem of machine translation

As language teachers, it seems that every day we have to battle the pernicious force of machine translation (MT). In 1997, Alta Vista launched Babelfish, one of the first web-based interfaces for MT. Twenty years later, it seems like every web portal, social network, and search engine offers some kind of automatic translation tool. Even LINE, the kawaii messaging service ubiquitous in Japan, offers an instant translation function, which behaves just like regular chat.

But despite its apparent popularity, and arguable usefulness as an assistive tool to human translators, MT is not a helpful technology for language teachers or learners. It is at best a nuisance, and at worst strongly detrimental to students’ second language acquisition.

The main problems with MT with regard to language pedagogy are that:

  1. It is inaccurate, especially for idiomatic expressions; and
  2. It negates students’ opportunities for language learning

The first of these problems can be easily observed when typing any reasonably idiomatic expression into Google Translate, perhaps the best free web-based MT available right now. Unfortunately, as we shall see, that’s not saying very much..

Exhibit 1

screen-shot-2016-10-26-at-10-50-59-pm

In this example we see the translator mess up the word order, and also render the verb “drink” as the noun “drink”. “I went to drink a beer with friends” is the more natural human-produced translation for this sentence.

Exhibit 2

screen-shot-2016-10-26-at-10-50-14-pm

In this example, again, the word order is completely jumbled, and the singular “best friend” doesn’t make sense when the question requires a plural response. Once again, the human generated translation is far superior: “How many close friends do you have?”.

I won’t labor the point here, but you can do your own experiments with any of the currently available MT tools, and you will inevitably come to the same conclusion: MT is still quite bad. Although it can usually convey the gist of the input sentence, it clearly lacks eloquence, idiomaticity and accuracy.

What to do about it

Having concluded that MT is not a good pedagogical tool, the question arises as to how we can eliminate its use both inside and outside the language classroom.

Layout 1
Banning smartphones/laptops seems like overkill, especially considering the more positive technological affordances they offer

Ban smart phones in the classroom?

Within the classroom, you could prevent the influence of MT by banning smart phones entirely. But if you do this, you are indiscriminately blocking off more fruitful avenues to autonomous learning, along with many other positive affordances offered by mobile devices.

Automatic MT detection

Outside the classroom, your power over students is limited, especially over those more inclined to take the “easy” option of MT in the first place. In addition, although we may strongly suspect a student of using MT outside class, it is often difficult to prove. Although progress is being made in developing MT detection tools, it is still nascent technology. Most of the solutions available at the moment require both the source and translation text in order to attempt to detect MT.

Manual MT detection

It can be possible, however, to manually detect and prove machine translation if you have a working knowledge of your students’ L1.

In a recent low-level speaking class, I asked students to record and transcribe their answers to a 1-minute speaking task. One student’s answer seemed suspiciously like “translationese”. One sentence in particular stood out: “Mother of rice is very delicious”. I guessed that the student had tried to translate the Japanese sentence “お母さんのご飯はとても美味しい” which would be more naturally rendered as “My mother’s rice is very tasty” or more idiomatically as “My mother makes very good rice”.

After inputting my hypothesis into Google Translate, I was presented with the exact same broken English as the student had used in his report. He was well and truly “busted”!

Sometimes it is possible to recreate the exact same bad translation through guesswork and a knowledge of your students' L1
Sometimes it is possible to recreate the exact same bad machine translation through guesswork and a knowledge of your students’ L1

Eliminate coursework

Of course, detecting and subsequently proving the use of MT for a pile of 20 or 30 written reports is a huge waste of time. However, because the temptation to use MT, especially for low-level, low-motivation students is so high, simply instructing students not to do so can be ineffective.

The use of MT became so prevalent with one of my lower level writing classes, that I decided to eliminate coursework altogether, and administer every written assessment in exam conditions. This was the only way I found that I could guarantee that students were not using MT in their written assignments.

Highlight the inadequacy of MT

An alternative solution for more highly motivated classes (those that actually care about developing their English accuracy and idiomaticity) is to highlight how bad MT can be, and in the process hopefully dissuade them from using it all together. One way to do this is to input some English phrases into an MT tool, and translate them into your students’ L1. Students will then understand in a more direct way how bad some of the translations can be.

Translating from English to your students' L1 with MT can be a useful consciousness raising activity
Translating from English to your students’ L1 with MT can be a useful consciousness raising activity. The Japanese translation on the right is very unnatural.

Conclusion

One day, machine translation may be accurate enough to make language teachers redundant, along with translators, interpreters, subtitlers, and a host of other language-related professions. It may cause an industry shake-up as far-reaching as self-driving cars. But that day is unlikely to be any time in the near future, despite how far we’ve come in recent years. The current generation of MT tools often produce inaccurate and unidiomatic translations. MT is unhelpful for English language pedagogy, and steps should be taken to detect and prevent students’ use of MT.

30 Links for English Language Data Geeks

A typical corpus linguist
A typical corpus linguist.. Although I personally prefer blue braces.
  1. The Moby Lexicon Project
  2. BNC Baby
  3. Full BNC
  4. Project Gutenberg (Download full database)
  5. CMU Pronouncing Dictionary
  6. GNU Collaborative International Dictionary of English
  7. The Internet Dictionary Project
  8. English Wikitionary Dump
  9. Simple English Wiktionary Dump
  10. JACET 8000
  11. Minimal pairs in English RP
  12. List of homographs
  13. Homophones in English RP
  14. Google’s Official List of Bad Words
  15. Yasumasa Someya’s Lemmas List
  16. MRC Psycholinguistic Database
  17. Million Song Dataset
  18. Penn Treebank P.O.S. Tags
  19. Princeton University’s WordNet
  20. The Sentence Corpus of Remedial English
  21. Summer Institute of Linguistics (SIL) Word List
  22. The Tanaka Corpus
  23. The General Service List
  24. The New General Service List
  25. The Academic Word List
  26. The New Academic Word List
  27. The TOEIC Word List
  28. The Business Service List
  29. Apache Open Office MyThes
  30. Global WordNet

Generating over 2000 flashcards from a DIY corpus of TOEFL material


Download the CSV of all 2,313 terms (inc. Japanese definitions) or access the full list on Quizlet.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Step 1: Assemble a corpus of TOEFL past papers

TOEFLFor my corpus, I used material from both the older CBT (Computer Based Test) and the current iBT (Internet Based Test). I found most of the materials online for free. Some were already in plain text format, but most were PDFs and required Optical Character Recognition (OCR) to convert to plain text. I used ABBYY’s FineReader Pro for Mac, but there are plenty of other options out there too. Some files were Microsoft Word format (.doc/.docx), and MacOS X’s batch conversion utility came in hand for these. I included model answers, listening transcripts, reading passages and multiple choice questions (prompts, distractors and answers). I tried to exclude explanations, advice and instructions from the authors and/or publishers.

Ultimately, I ended up with corpus just shy of a million words (959,124 to be precise). In general, bigger is better when it comes to corpus research. The TOEIC Service List (TSL) utilizes a corpus of about 1.5 million words, so my TOEFL corpus seems roughly comparable to this.

Step 2: Count the number of occurrences of each word

I used some custom PHP code to process my corpus data (although Python is probably more suited for corpus analysis). I lemmatized each token where possible using Yasumasa Someya’s list of lemmas. I then cross referenced each lemma occurrence with the NGSL, NAWL and TSL. Finally, I exported to a CSV, and ended up with 13,287 rows of data.

Step 3: Curate the final list

For my final list I removed any words which also appear on the NGSL, any contractions (e.g. “Don’t”,”I’m”,”that’s”), any numbers written in word form (e.g. “two”,”million”), any vocalizations (e.g. “uh”,”oh”), any ordinals (e.g. “first”,”second”,”third”), any proper nouns (“James”, “Elizabeth”, “America”, “San Francisco”, “New York”), and any words with fewer than 5 occurrences in the corpus. Next, I ran the list through a spell checker, and excluded any unrecognized words. I also excluded any non-lexical words, to leave a list consisting only of nouns, verbs, adjectives and adverbs.

Step 4. Generate flashcards

I now had a list of 2313 terms, made up of 523 adjectives, 123 adverbs, 1366 nouns, and 301 verbs. I used Text to Flash to generate Japanese definitions for each word, then uploaded the words to Quizlet, separated into part-of-speech and ordered alphabetically.

Multilingual, part-of-speech categorized, difficulty sorted Quizlet flashcards for NGSL, NAWL and TSL

Feb. 2017 update

Unfortunately, after uploading all the flashcard sets to Quizlet, my account started to run so slowly that it became unusable. I had to remove the majority of the data from Quizlet, but I am now offering the data to download in CSV format. Users can upload the flashcards to their own Quizlet accounts if required by using the import function.

Links to the CSVs are as follows:

Each download (.zip) includes translations for: Arabic, Chinese, Dutch, English, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. Translations were automatically generated from public domain dictionary sources.


I’ve generated multilingual, part-of-speech categorized, difficulty sorted sets of flashcards for the latest New General Service List (NGSL), New Academic Word List (NAWL) and TOEIC Service List (TSL), and added them to Quizlet.

The sets are organized in classes according to the definition language. Each class contains sets of flashcards for the four lexical parts of speech (adverbs, verbs, adjectives and nouns). There are a maximum of 20 flashcards in each set, and the sets are ordered by difficulty (i.e. frequency), with Part 1 of each list containing the easiest (most common) words.

As no information was given about part-of-speech in the word lists themselves, I tagged the words using Moby, and selected only the most common part-of-speech for words which can be used as multiple parts-of-speech. The word “register”, for example, is listed as a noun by Moby before it is listed as a verb, so only the noun definition of “register” was included in the flashcards.

10 years in Japan

Today I mark 10 years living and working in Japan. To commemorate the occasion, here is one of my first blog posts from October 2006:


Some things about Japan that I’ve noticed:

  • The plugs don’t have switches, so if you want to turn something off, you have to physically unplug it
  • Semi-automatic doors: they lack motion sensors and only open when you press the button
  • Pelican crossings have no buttons to press
  • When it rains, everyone uses an umbrella
  • There are little racks in which to put your wet umbrella when entering shops
  • The Japanese are incredibly polite: one night some of us got lost, and when we asked for directions, we were escorted by a stranger for a good half-mile to the train station, which was the opposite direction to which he had been walking
  • The local gaijin pub, Mattari, serves fish and chips
  • The Japanese like queuing even more than the British. You might even expect to find them queuing on the platform for trains
  • There are lots of bikes
  • Pachinko parlors: buy yourself a tub full of ball bearings and pour them into an inverted pinball machine. Adopt an expression of post-lobotomy desolation. These places are completely insane.

For a more comprehensive run down of the past decade, check out my post on TEFL Journey.

20 Tech Tips from Vocab@Tokyo 2016

  1. Tom Cobb’s venerable Lex Tutor now has a mobile interface
  2. Collins and Merriam-Webster both provide free online dictionaries
  3. The University of Texas at Austin provides a wide selection of free handouts (PDF) for teachers of English language writing
  4. Calibre is a comprehensive e-book manager and converter
  5. OmniPage and ABBYY FineReader are powerful OCR (Optical Character Recognition) applications
  6. The Lexical Research Foundation is “a not-for-profit organisation to promote excellence in lexical and vocabulary acquisition, description and pedagogy.”
  7. AntWordProfiler, Web VocabProfile, Range, and P_Lex (PDF) are tools for profiling lexical sophistication of a text, i.e. the proportion of advanced (rare) vocabulary…
  8. …while TextInspector can be used to measure lexical variation, i.e. the proportion of word types to tokens
  9. Michael Covington has developed a number of algorithms and tools for analyzing texts, including Moving Average Type-Token Ratio (MATTR)
  10. Paul Nation’s book, What You Need to Know to Learn a Foreign Language, is available as a free PDF download…
  11. …as are all his Vocabulary Size Tests (VST)…
  12. …which can also be taken online via Tom Cobb’s site
  13. Laurence Anthony’s WebSCoRE is “a free, parallel concordancer with a specially developed bilingual pedagogical corpus”
  14. Paul Meara’s Lognostics website “is designed to provide access to up to date research tools for people working in the field of Second Language Vocabulary Acquisition”
  15. Vocabulary Learning and Instruction (VLI) is an open access international journal for research relating to vocabulary acquisition, instruction, and assessment.
  16. Showbie is a great tool for keeping digital portfolios of students’ work
  17. Coh-Metrix is a system for computing computational cohesion and coherence metrics for written and spoken texts
  18. Lexile Analyzer can be used to compute the complexity of a text, including sentence length and word frequency
  19. Cambridge University Press’s English Vocabulary Profile (EVP) “offers reliable information about which words and phrases are known and used by learners at each level of the Common European Framework (CEF)”
  20. The CEFR-J website provides a series of “can-do” descriptors specifically for English language teaching contexts in Japan.

30 Tech Tips from JALT CALL 2016

healthcare-technology-8-04-2015

  1. James Rogers gives pronunciation advice for Japanese learners of English
  2. Linode is a powerful and good value web host
  3. The Multiplayer Classroom (Lee Sheldon) was one of the first publications arguing for gamification of education
  4. Class Craft helps you to make learning an adventure
  5. Socrative allows you to administer assessments and surveys via mobile phones
  6. Kahoot provides gamified classroom activities
  7. QuizUp offers a competitive multi-player gaming experience
  8. Sendtodropbox is a great way of getting files from your students into your Dropbox account…
  9. …while QuickVoice (iOS) allows you to record and send audio files as email attachments up to 5MB in size…
  10. …and MailVU are specialists in sending video via email
  11. Moxtra is a mobile-first embeddable collaboration platform…
  12. …and VoiceThread allows students to submit audio as attachments to images
  13. Schoology is a modern Learner Management System
  14. Ginger offers a variety of apps for online translation and grammar checking…
  15. …while Grammarly claims to make you a better writer by finding and correctly 10 times more mistakes than you word processor
  16. WikiTude is the world’s leading augmented reality SDK
  17. Diigo allows you to annotate and save web pages as you browse them
  18. Tiki Toki is web based software for creating beautiful timelines
  19. iBuildApp allows you to easily make apps for iOS or Android
  20. Mobyx (iOS) provides high quality VOIP (Voice over IP) services
  21. KanjiTomo is a comprehensive OCR (Optical Character Recognition) application for Japanese characters…
  22. …while Yomiwa (iOS) provides a real-time offline camera translator for Japanese…
  23. …and Perfect Master Kanji (iOS) is a fully fledged kanji practice app for people learning Japanese as a foreign language…
  24. …and Nihongo Shark provides free daily lessons for learners of Japanese
  25. Discord provides all-in-one text and voice chat for gamers
  26. Continuous Partial Attention (Linda Stone) is “motivated by a desire to be a LIVE node on the network”
  27. Wiggle allows you to easily import sporting goods and accessories (a weird one, but a good one for those of us who struggle to find bicycles big enough in Japan!)
  28. Phonologics offers automated pronunciation testing
  29. Words and Monsters taps into the addictive game play of apps like Puzzle and Dragons and Candy Crush by offering uncertain and unexpected rewards
  30. Paul Howard Jones is the preeminent expert on the effect of games on the brain