The War Against Machine Translation

56adctw
LINE’s machine translation function can easily be confused for idle chat, but in fact it is potentially much more harmful

The problem of machine translation

As language teachers, it seems that every day we have to battle the pernicious force of machine translation (MT). In 1997, Alta Vista launched Babelfish, one of the first web-based interfaces for MT. Twenty years later, it seems like every web portal, social network, and search engine offers some kind of automatic translation tool. Even LINE, the kawaii messaging service ubiquitous in Japan, offers an instant translation function, which behaves just like regular chat.

But despite its apparent popularity, and arguable usefulness as an assistive tool to human translators, MT is not a helpful technology for language teachers or learners. It is at best a nuisance, and at worst strongly detrimental to students’ second language acquisition.

The main problems with MT with regard to language pedagogy are that:

  1. It is inaccurate, especially for idiomatic expressions; and
  2. It negates students’ opportunities for language learning

The first of these problems can be easily observed when typing any reasonably idiomatic expression into Google Translate, perhaps the best free web-based MT available right now. Unfortunately, as we shall see, that’s not saying very much..

Exhibit 1

screen-shot-2016-10-26-at-10-50-59-pm

In this example we see the translator mess up the word order, and also render the verb “drink” as the noun “drink”. “I went to drink a beer with friends” is the more natural human-produced translation for this sentence.

Exhibit 2

screen-shot-2016-10-26-at-10-50-14-pm

In this example, again, the word order is completely jumbled, and the singular “best friend” doesn’t make sense when the question requires a plural response. Once again, the human generated translation is far superior: “How many close friends do you have?”.

I won’t labor the point here, but you can do your own experiments with any of the currently available MT tools, and you will inevitably come to the same conclusion: MT is still quite bad. Although it can usually convey the gist of the input sentence, it clearly lacks eloquence, idiomaticity and accuracy.

What to do about it

Having concluded that MT is not a good pedagogical tool, the question arises as to how we can eliminate its use both inside and outside the language classroom.

Layout 1
Banning smartphones/laptops seems like overkill, especially considering the more positive technological affordances they offer

Ban smart phones in the classroom?

Within the classroom, you could prevent the influence of MT by banning smart phones entirely. But if you do this, you are indiscriminately blocking off more fruitful avenues to autonomous learning, along with many other positive affordances offered by mobile devices.

Automatic MT detection

Outside the classroom, your power over students is limited, especially over those more inclined to take the “easy” option of MT in the first place. In addition, although we may strongly suspect a student of using MT outside class, it is often difficult to prove. Although progress is being made in developing MT detection tools, it is still nascent technology. Most of the solutions available at the moment require both the source and translation text in order to attempt to detect MT.

Manual MT detection

It can be possible, however, to manually detect and prove machine translation if you have a working knowledge of your students’ L1.

In a recent low-level speaking class, I asked students to record and transcribe their answers to a 1-minute speaking task. One student’s answer seemed suspiciously like “translationese”. One sentence in particular stood out: “Mother of rice is very delicious”. I guessed that the student had tried to translate the Japanese sentence “お母さんのご飯はとても美味しい” which would be more naturally rendered as “My mother’s rice is very tasty” or more idiomatically as “My mother makes very good rice”.

After inputting my hypothesis into Google Translate, I was presented with the exact same broken English as the student had used in his report. He was well and truly “busted”!

Sometimes it is possible to recreate the exact same bad translation through guesswork and a knowledge of your students' L1
Sometimes it is possible to recreate the exact same bad machine translation through guesswork and a knowledge of your students’ L1

Eliminate coursework

Of course, detecting and subsequently proving the use of MT for a pile of 20 or 30 written reports is a huge waste of time. However, because the temptation to use MT, especially for low-level, low-motivation students is so high, simply instructing students not to do so can be ineffective.

The use of MT became so prevalent with one of my lower level writing classes, that I decided to eliminate coursework altogether, and administer every written assessment in exam conditions. This was the only way I found that I could guarantee that students were not using MT in their written assignments.

Highlight the inadequacy of MT

An alternative solution for more highly motivated classes (those that actually care about developing their English accuracy and idiomaticity) is to highlight how bad MT can be, and in the process hopefully dissuade them from using it all together. One way to do this is to input some English phrases into an MT tool, and translate them into your students’ L1. Students will then understand in a more direct way how bad some of the translations can be.

Translating from English to your students' L1 with MT can be a useful consciousness raising activity
Translating from English to your students’ L1 with MT can be a useful consciousness raising activity. The Japanese translation on the right is very unnatural.

Conclusion

One day, machine translation may be accurate enough to make language teachers redundant, along with translators, interpreters, subtitlers, and a host of other language-related professions. It may cause an industry shake-up as far-reaching as self-driving cars. But that day is unlikely to be any time in the near future, despite how far we’ve come in recent years. The current generation of MT tools often produce inaccurate and unidiomatic translations. MT is unhelpful for English language pedagogy, and steps should be taken to detect and prevent students’ use of MT.

30 Links for English Language Data Geeks

A typical corpus linguist
A typical corpus linguist.. Although I personally prefer blue braces.
  1. The Moby Lexicon Project
  2. BNC Baby
  3. Full BNC
  4. Project Gutenberg (Download full database)
  5. CMU Pronouncing Dictionary
  6. GNU Collaborative International Dictionary of English
  7. The Internet Dictionary Project
  8. English Wikitionary Dump
  9. Simple English Wiktionary Dump
  10. JACET 8000
  11. Minimal pairs in English RP
  12. List of homographs
  13. Homophones in English RP
  14. Google’s Official List of Bad Words
  15. Yasumasa Someya’s Lemmas List
  16. MRC Psycholinguistic Database
  17. Million Song Dataset
  18. Penn Treebank P.O.S. Tags
  19. Princeton University’s WordNet
  20. The Sentence Corpus of Remedial English
  21. Summer Institute of Linguistics (SIL) Word List
  22. The Tanaka Corpus
  23. The General Service List
  24. The New General Service List
  25. The Academic Word List
  26. The New Academic Word List
  27. The TOEIC Word List
  28. The Business Service List
  29. Apache Open Office MyThes
  30. Global WordNet

NGSL Redirect

If you are coming from Charlie Browne’s NGSL/NAWL/TSL/BSL sites, please see my new blog post for multilingual and other additional data for these lists:

The NGSL, NAWL and TSL are now available for Quizlet in 16 different languages. Paul Raines originally had them up directly on the Quizlet site, but they needed to be taken down due to too many lists being associated with one account. Fortunately Paul has figured out a good workaround by making all files available in .csv format from his website. He’s generated definitions and part-of-speech from public domain dictionaries for the following languages: Arabic, Chinese, Dutch, English, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. Links to the lists can be found via a blogpost on his website.

10 years in Japan

Today I mark 10 years living and working in Japan. To commemorate the occasion, here is one of my first blog posts from October 2006:


Some things about Japan that I’ve noticed:

  • The plugs don’t have switches, so if you want to turn something off, you have to physically unplug it
  • Semi-automatic doors: they lack motion sensors and only open when you press the button
  • Pelican crossings have no buttons to press
  • When it rains, everyone uses an umbrella
  • There are little racks in which to put your wet umbrella when entering shops
  • The Japanese are incredibly polite: one night some of us got lost, and when we asked for directions, we were escorted by a stranger for a good half-mile to the train station, which was the opposite direction to which he had been walking
  • The local gaijin pub, Mattari, serves fish and chips
  • The Japanese like queuing even more than the British. You might even expect to find them queuing on the platform for trains
  • There are lots of bikes
  • Pachinko parlors: buy yourself a tub full of ball bearings and pour them into an inverted pinball machine. Adopt an expression of post-lobotomy desolation. These places are completely insane.

For a more comprehensive run down of the past decade, check out my post on TEFL Journey.

20 Tech Tips from Vocab@Tokyo 2016

  1. Tom Cobb’s venerable Lex Tutor now has a mobile interface
  2. Collins and Merriam-Webster both provide free online dictionaries
  3. The University of Texas at Austin provides a wide selection of free handouts (PDF) for teachers of English language writing
  4. Calibre is a comprehensive e-book manager and converter
  5. OmniPage and ABBYY FineReader are powerful OCR (Optical Character Recognition) applications
  6. The Lexical Research Foundation is “a not-for-profit organisation to promote excellence in lexical and vocabulary acquisition, description and pedagogy.”
  7. AntWordProfiler, Web VocabProfile, Range, and P_Lex (PDF) are tools for profiling lexical sophistication of a text, i.e. the proportion of advanced (rare) vocabulary…
  8. …while TextInspector can be used to measure lexical variation, i.e. the proportion of word types to tokens
  9. Michael Covington has developed a number of algorithms and tools for analyzing texts, including Moving Average Type-Token Ratio (MATTR)
  10. Paul Nation’s book, What You Need to Know to Learn a Foreign Language, is available as a free PDF download…
  11. …as are all his Vocabulary Size Tests (VST)…
  12. …which can also be taken online via Tom Cobb’s site
  13. Laurence Anthony’s WebSCoRE is “a free, parallel concordancer with a specially developed bilingual pedagogical corpus”
  14. Paul Meara’s Lognostics website “is designed to provide access to up to date research tools for people working in the field of Second Language Vocabulary Acquisition”
  15. Vocabulary Learning and Instruction (VLI) is an open access international journal for research relating to vocabulary acquisition, instruction, and assessment.
  16. Showbie is a great tool for keeping digital portfolios of students’ work
  17. Coh-Metrix is a system for computing computational cohesion and coherence metrics for written and spoken texts
  18. Lexile Analyzer can be used to compute the complexity of a text, including sentence length and word frequency
  19. Cambridge University Press’s English Vocabulary Profile (EVP) “offers reliable information about which words and phrases are known and used by learners at each level of the Common European Framework (CEF)”
  20. The CEFR-J website provides a series of “can-do” descriptors specifically for English language teaching contexts in Japan.

If you found these tips useful, why not check out the new version of my book, which has been revised, updated and expanded for 2019: 50 Ways to Teach with Technology