Rankings, definitions, pronunciations and additional data for NGSL, NAWL, TSL, and BSL

I have generated supplementary data for four word lists (NGSL, NAWL, TSL, and BSL) originally created by Dr. Charles Browne et al. The supplementary data includes:

  1. Word: the word (lemma) as it appears on the original list
  2. POS: the most common part-of-speech for the word according to the Moby Part-of-Speech database
  3. BNC Rank: the frequency ranking of the word according to the British National Corpus (lower number equals higher frequency)
  4. Google Rank: the frequency ranking of the word according to the Google Corpus (lower number equals higher frequency)
  5. IPA: the International Phonetic Alphabet transcription of the word, using data derived from the CMU Pronuncing Dictionary
  6. Conjugations: variations of the form of the word according to tense, person, etc*
  7. Synonyms: a list of words with similar or related meanings*
  8. – 23. Multilingual definitions: Arabic, Chinese, German, Greek, English, French, Italian, Japanese, Korean, Dutch, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish*

*Data provided by public domain dictionary/thesaurus sources, where available.

Download the data:

(Click the name of the list you require to open a read-only Google Spreadsheet. From the Google Spreadsheet, click “File” => “Download as” then choose your required format)

This supplementary data is available under the same license as the original lists: Creative Commons Attribution-ShareAlike 4.0 International License.

Introducing the TfSL (TOEFL Service List)

The final list consists of 3,773 high frequency TOEFL words, and can be downloaded here.

Step 1: Assemble a corpus of TOEFL materials

TOEFLFor my corpus, I used material from both the older CBT (Computer Based Test) and the current iBT (Internet Based Test). I found most of the materials online for free. Some were already in plain text format, but most were PDFs and required Optical Character Recognition (OCR) to convert to plain text. I used ABBYY’s FineReader Pro for Mac, but there are plenty of other options out there too. Some files were Microsoft Word format (.doc/.docx), and MacOS X’s batch conversion utility came in hand for these. I included model answers, listening transcripts, reading passages and multiple choice questions (prompts, distractors and answers). I tried to exclude explanations, advice and instructions from the authors and/or publishers.

Ultimately, I ended up with corpus just shy of a million words (959,124 to be precise). In general, bigger is better when it comes to corpus research. The TOEIC Service List (TSL) utilizes a corpus of about 1.5 million words, so my TOEFL corpus seems roughly comparable to this.

Step 2: Count the number of occurrences of each word

I used some custom PHP code to process my corpus data (although Python is probably more suited for corpus analysis). I lemmatized each token where possible using Yasumasa Someya’s list of lemmas. I then cross referenced each lemma occurrence with the NGSL, NAWL and TSL. Finally, I exported to a CSV, and ended up with 13,287 rows of data.

Step 3: Curate the final list

For my final list I removed any words which also appear on the NGSL, any contractions (e.g. “Don’t”,”I’m”,”that’s”), any numbers written in word form (e.g. “two”,”million”), any vocalizations (e.g. “uh”,”oh”), any ordinals (e.g. “first”,”second”,”third”), any proper nouns (“James”, “Elizabeth”, “America”, “San Francisco”, “New York”), and any words with fewer than 5 occurrences in the corpus. Next, I ran the list through a spell checker, and excluded any unrecognized words. I also excluded any non-lexical words, to leave a list consisting only of nouns, verbs, adjectives and adverbs.