230,000 real sounding “fake” words

The list is available under a Creative Commons license, and can be viewed and downloaded here.


The list of real sounding “fake” words used for the new Apps 4  EFL activity “Fight the Fakes” is now available for download.

The list was generated by looping through each of the words from the SIL list and splitting them into three-letter chunks. A Markov chain process was then used to determine which of the three letter chunks were most likely to precede or follow each other. The three-letter chunks were then recombined according to these likelihoods in order to create realistic sounding neologisms of various lengths, e.g.

  • generotizing
  • liminativate
  • coronably
  • solarians
  • troscorifyingly

The words were doubled checked against the SIL list to ensure no real words were accidentally generated.

Fun ways to teach with the words

  • Try the new Apps 4 EFL activity Fight the Fakes, which uses the words as distractors against low frequency items from the BNC
  • Ask your students to try and invent “definitions” for the fake words based on what they sound like, e.g. “hispanelist (n.), chat show panelist from Latin America”, “mandibilious (adj.), used to describe an animal with extraordinarily strong jaws”, “rattlesnatcher (n.), a person who goes around stealing toys from small children”
  • Use them as in Yes/No vocabulary knowledge tests to ensure students don’t cheat by clicking “Yes, I know this word” for every item

Introducing the TfSL (TOEFL Service List)

The final list consists of 3,773 high frequency TOEFL words, and can be downloaded here.


Step 1: Assemble a corpus of TOEFL materials

TOEFLFor my corpus, I used material from both the older CBT (Computer Based Test) and the current iBT (Internet Based Test). I found most of the materials online for free. Some were already in plain text format, but most were PDFs and required Optical Character Recognition (OCR) to convert to plain text. I used ABBYY’s FineReader Pro for Mac, but there are plenty of other options out there too. Some files were Microsoft Word format (.doc/.docx), and MacOS X’s batch conversion utility came in hand for these. I included model answers, listening transcripts, reading passages and multiple choice questions (prompts, distractors and answers). I tried to exclude explanations, advice and instructions from the authors and/or publishers.

Ultimately, I ended up with corpus just shy of a million words (959,124 to be precise). In general, bigger is better when it comes to corpus research. The TOEIC Service List (TSL) utilizes a corpus of about 1.5 million words, so my TOEFL corpus seems roughly comparable to this.

Step 2: Count the number of occurrences of each word

I used some custom PHP code to process my corpus data (although Python is probably more suited for corpus analysis). I lemmatized each token where possible using Yasumasa Someya’s list of lemmas. I then cross referenced each lemma occurrence with the NGSL, NAWL and TSL. Finally, I exported to a CSV, and ended up with 13,287 rows of data.

Step 3: Curate the final list

For my final list I removed any words which also appear on the NGSL, any contractions (e.g. “Don’t”,”I’m”,”that’s”), any numbers written in word form (e.g. “two”,”million”), any vocalizations (e.g. “uh”,”oh”), any ordinals (e.g. “first”,”second”,”third”), any proper nouns (“James”, “Elizabeth”, “America”, “San Francisco”, “New York”), and any words with fewer than 5 occurrences in the corpus. Next, I ran the list through a spell checker, and excluded any unrecognized words. I also excluded any non-lexical words, to leave a list consisting only of nouns, verbs, adjectives and adverbs.

30 Links for English Language Data Geeks

A typical corpus linguist
A typical corpus linguist.. Although I personally prefer blue braces.
  1. The Moby Lexicon Project
  2. BNC Baby
  3. Full BNC
  4. Project Gutenberg (Download full database)
  5. CMU Pronouncing Dictionary
  6. GNU Collaborative International Dictionary of English
  7. The Internet Dictionary Project
  8. English Wikitionary Dump
  9. Simple English Wiktionary Dump
  10. JACET 8000
  11. Minimal pairs in English RP
  12. List of homographs
  13. Homophones in English RP
  14. Google’s Official List of Bad Words
  15. Yasumasa Someya’s Lemmas List
  16. MRC Psycholinguistic Database
  17. Million Song Dataset
  18. Penn Treebank P.O.S. Tags
  19. Princeton University’s WordNet
  20. The Sentence Corpus of Remedial English
  21. Summer Institute of Linguistics (SIL) Word List
  22. The Tanaka Corpus
  23. The General Service List
  24. The New General Service List
  25. The Academic Word List
  26. The New Academic Word List
  27. The TOEIC Word List
  28. The Business Service List
  29. Apache Open Office MyThes
  30. Global WordNet

The rocky road to LMS web app integration

Part 1

I’m an amateur coder. For the past couple of years, I’ve been developing a site called Apps 4 EFL; half LMS, half Web-Based Language Learning platform. It all started when I wanted to automatically generate language learning activities directly from Wikipedia articles. I’d had a bit of experience coding as a teenager, but then went on to pursue other interests (chiefly become an EFL teacher). It was a steep learning curve to develop the extant coding knowledge I had enough to achieve my aims, but I think I did an OK job in the end (the site works, although I doubt it’s the most efficient or clean code by a long way). The main reason I was able to achieve my aims was because of the vast array of tutorials and example code available on the web these days, including:

…to name a few. I highly recommend these resources to anybody thinking of learning coding from scratch, or developing the skills they already have.

Part 1 TLDR: I developed some useful(?) web apps for teachers and learners of EFL.

Part 2

Originally, Apps 4 EFL had no way to track learners’ progress. It simply generated pedagogical activities learners could complete online. So I decided to implement some kind of tracking system, and this is where things got complicated.

Signing up for websites is one of the worst things about the internet. Period. They all seem to require a different set of information about you, none of which you really feel they need to know. Your birthday. Your email. Sometimes your telephone number and address. And the passwords. SO. MANY. PASSWORDS.

Now multiply that issue by 25, for the number of students you have in your class.

Now multiply again by 10 for the number of classes you teach a week.

Suddenly you have 250 accounts to register, and 250 students who risk forgetting the passwords they have created, not to mention the URLs of the site(s) themselves.

Part 2 TLDR: Teachers needed a way to track their users engagement with the apps, but didn’t want to register all their students for yet another website.

Part 3

This is where LMSs (Learner Management Systems) come in to play. They take care of user registration so you don’t have to. LMSs such as Moodle even offer an array of different question types, some of which are amenable to EFL pedagogy. However, they are designed to accommodate a wide range of teaching and learning contexts, and therefore lack the specific tools we might want for our own unique disciplines.

But don’t think the LMS creators hadn’t thought of this issue – they had, and it wasn’t long before there were proposals for a variety of ways to link LMSs to external apps and share data between the LMS and the app.

Part 3 TLDR: LMSs can be used to manage user registration whilst facilitating access to subject specific tools.

Part 4

There are several solutions now available to (amateur) app developers to get their tools working in conjunction with existing LMSs:

  1. Passing parameters through the URL
  2. Creating an LMS plugin
  3. Creating an LTI compliant app

Passing parameters through the URL

The first of these methods is the easiest, but the most limited, as data can only go one way (from the LMS to the app). It can be achieved through Moodle, for example, by adding the “URL” resource to a course. Once added, a section called “URL Parameters” is available, which can be used to pass information about the LMS, the course, and the specific user through URL parameters to the target tool (which can be accessed by the tool with a simple $_GET statement in PHP).

Creating an LMS plugin

The second method allows for much greater integration with the LMS, and allows data to flow both to and from the app, so scores obtained from the app can be saved directly to the LMS. However, the drawbacks are numerous. Only users of that particular LMS will be able to utilize your app, and you’ll have to develop separate versions for other LMSs (and yes – there are a quite few of them) as required. The second problem is that its much more difficult to develop LMS plugins than it is to develop standalone web-apps, not least because you have to understand the way the LMS itself is designed and written before you can even start developing your app. Tutorials, where available, may be out-of-date or incomplete.

Creating an LTI compliant app

Learning Tools Interoperability (or “LTI” for less of a mouthful) is a specification developed (and trademarked?!) by the IMS Global Learning Consortium.  It basically provides a way for LMSs to interact with external apps, and most importantly for data to flow both ways, so user information can be provided to the app, and progress data can flow back to the LMS. It is by far the most promising of the three methods discussed here, and also the most complicated and difficult to implement, especially for amateur coders working alone.

Just take a look at the implementation guide. Go on, I dare you. It’s only 12,000 words long.

OK, well maybe we don’t need to read the whole thing to get it working. There must be some example code? Yes, there is. But it’s out of date (the PHP code pertains to version 1 of the specification, not the latest version 2), and the implementation instructions are enough to make you go cross-eyed. It is by no means facile to get an LTI compliant app working. Further complications are caused by the ever-so-slightly yet infuriatingly different specifications adopted by different LMSs (Canvas vs Moodle, for example). And this is coming from someone who has navigated his way through Apple’s needlessly complicated app provisioning process.

Part 4 TLDR: The current ways to integrate external apps with LMSs are either too limited or overly complex to set up

Part 5 – The final word

For educators dabbling in code (and this is something that’s going to increase as programming enters the curriculum) there needs to be a simple yet powerful way to implement web apps with LMSs. Not every app should have to provide a complicated user management system in order to track progress – not when better solutions already exist, and when doing so only complicates teacher’s lives instead of making them easier and more productive. Something between URL parameters and LTI compliance would be fantastic, and hopefully a solution will present itself in the near future.