230,000 real sounding “fake” words

The list is available under a Creative Commons license, and can be viewed and downloaded here.

The list of real sounding “fake” words used for the new Apps 4  EFL activity “Fight the Fakes” is now available for download.

The list was generated by looping through each of the words from the SIL list and splitting them into three-letter chunks. A Markov chain process was then used to determine which of the three letter chunks were most likely to precede or follow each other. The three-letter chunks were then recombined according to these likelihoods in order to create realistic sounding neologisms of various lengths, e.g.

  • generotizing
  • liminativate
  • coronably
  • solarians
  • troscorifyingly

The words were doubled checked against the SIL list to ensure no real words were accidentally generated.

Fun ways to teach with the words

  • Try the new Apps 4 EFL activity Fight the Fakes, which uses the words as distractors against low frequency items from the BNC
  • Ask your students to try and invent “definitions” for the fake words based on what they sound like, e.g. “hispanelist (n.), chat show panelist from Latin America”, “mandibilious (adj.), used to describe an animal with extraordinarily strong jaws”, “rattlesnatcher (n.), a person who goes around stealing toys from small children”
  • Use them as in Yes/No vocabulary knowledge tests to ensure students don’t cheat by clicking “Yes, I know this word” for every item

Rankings, definitions, pronunciations and additional data for NGSL, NAWL, TSL, and BSL

I have generated supplementary data for four word lists (NGSL, NAWL, TSL, and BSL) originally created by Dr. Charles Browne et al. The supplementary data includes:

  1. Word: the word (lemma) as it appears on the original list
  2. POS: the most common part-of-speech for the word according to the Moby Part-of-Speech database
  3. BNC Rank: the frequency ranking of the word according to the British National Corpus (lower number equals higher frequency)
  4. Google Rank: the frequency ranking of the word according to the Google Corpus (lower number equals higher frequency)
  5. IPA: the International Phonetic Alphabet transcription of the word, using data derived from the CMU Pronuncing Dictionary
  6. Conjugations: variations of the form of the word according to tense, person, etc*
  7. Synonyms: a list of words with similar or related meanings*
  8. – 23. Multilingual definitions: Arabic, Chinese, German, Greek, English, French, Italian, Japanese, Korean, Dutch, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish*

*Data provided by public domain dictionary/thesaurus sources, where available.

Download the data:

This supplementary data is available under the same license as the original lists: Creative Commons Attribution-ShareAlike 4.0 International License.

Introducing the TfSL (TOEFL Service List)

The final list consists of 3,773 high frequency TOEFL words, and can be downloaded here.

Step 1: Assemble a corpus of TOEFL materials

TOEFLFor my corpus, I used material from both the older CBT (Computer Based Test) and the current iBT (Internet Based Test). I found most of the materials online for free. Some were already in plain text format, but most were PDFs and required Optical Character Recognition (OCR) to convert to plain text. I used ABBYY’s FineReader Pro for Mac, but there are plenty of other options out there too. Some files were Microsoft Word format (.doc/.docx), and MacOS X’s batch conversion utility came in hand for these. I included model answers, listening transcripts, reading passages and multiple choice questions (prompts, distractors and answers). I tried to exclude explanations, advice and instructions from the authors and/or publishers.

Ultimately, I ended up with corpus just shy of a million words (959,124 to be precise). In general, bigger is better when it comes to corpus research. The TOEIC Service List (TSL) utilizes a corpus of about 1.5 million words, so my TOEFL corpus seems roughly comparable to this.

Step 2: Count the number of occurrences of each word

I used some custom PHP code to process my corpus data (although Python is probably more suited for corpus analysis). I lemmatized each token where possible using Yasumasa Someya’s list of lemmas. I then cross referenced each lemma occurrence with the NGSL, NAWL and TSL. Finally, I exported to a CSV, and ended up with 13,287 rows of data.

Step 3: Curate the final list

For my final list I removed any words which also appear on the NGSL, any contractions (e.g. “Don’t”,”I’m”,”that’s”), any numbers written in word form (e.g. “two”,”million”), any vocalizations (e.g. “uh”,”oh”), any ordinals (e.g. “first”,”second”,”third”), any proper nouns (“James”, “Elizabeth”, “America”, “San Francisco”, “New York”), and any words with fewer than 5 occurrences in the corpus. Next, I ran the list through a spell checker, and excluded any unrecognized words. I also excluded any non-lexical words, to leave a list consisting only of nouns, verbs, adjectives and adverbs.