I have generated supplementary data for four word lists (NGSL, NAWL, TSL, and BSL) originally created by Dr. Charles Browne et al. The supplementary data includes:
- Word: the word (lemma) as it appears on the original list
- POS: the most common part-of-speech for the word according to the Moby Part-of-Speech database
- BNC Rank: the frequency ranking of the word according to the British National Corpus (lower number equals higher frequency)
- Google Rank: the frequency ranking of the word according to the Google Corpus (lower number equals higher frequency)
- IPA: the International Phonetic Alphabet transcription of the word, using data derived from the CMU Pronuncing Dictionary
- Conjugations: variations of the form of the word according to tense, person, etc*
- Synonyms: a list of words with similar or related meanings*
- – 23. Multilingual definitions: Arabic, Chinese, German, Greek, English, French, Italian, Japanese, Korean, Dutch, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish*
*Data provided by public domain dictionary/thesaurus sources, where available.
Download the data:
- New General Service List (NGSL)
- New Academic Word List (NAWL)
- TOEIC Service List (TSL)
- Business Service List (BSL)
This supplementary data is available under the same license as the original lists: Creative Commons Attribution-ShareAlike 4.0 International License.
For my corpus, I used material from both the older CBT (Computer Based Test) and the current iBT (Internet Based Test). I found most of the materials online for free. Some were already in plain text format, but most were