Rankings, definitions, pronunciations and additional data for NGSL, NAWL, TSL, and BSL

I have generated supplementary data for four word lists (NGSL, NAWL, TSL, and BSL) originally created by Dr. Charles Browne et al. The supplementary data includes:

  1. Word: the word (lemma) as it appears on the original list
  2. POS: the most common part-of-speech for the word according to the Moby Part-of-Speech database
  3. BNC Rank: the frequency ranking of the word according to the British National Corpus (lower number equals higher frequency)
  4. Google Rank: the frequency ranking of the word according to the Google Corpus (lower number equals higher frequency)
  5. IPA: the International Phonetic Alphabet transcription of the word, using data derived from the CMU Pronuncing Dictionary
  6. Conjugations: variations of the form of the word according to tense, person, etc*
  7. Synonyms: a list of words with similar or related meanings*
  8. – 23. Multilingual definitions: Arabic, Chinese, German, Greek, English, French, Italian, Japanese, Korean, Dutch, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish*

*Data provided by public domain dictionary/thesaurus sources, where available.


Download the data:

This supplementary data is available under the same license as the original lists: Creative Commons Attribution-ShareAlike 4.0 International License.

5 thoughts on “Rankings, definitions, pronunciations and additional data for NGSL, NAWL, TSL, and BSL

  1. Hi, Paul

    I want to create a set of study cards based on the NGSL with POS, BNC Rank, Google Rank, IPA, Conjugations, Synonyms, and Definitions in English and the learner’s language.

    Do you know if the information you have provided can be reproduced on a commercial basis?

    i.e. are there restrictions on use of the information beyond CC 4.0?

    I intend to use them in the school I work at, but it’s a lot of work to keep to myself, so monetising seems sensible if possible. If there are restrictions, I could remove those elements in a commercial design.

    I know you are not the rights holder of all of these data sets, but perhaps you have some insight you could share.

    I’d be very happy to collaborate on such a project, if you are interested.

    I have been looking at other corpus data, especially COCA, but they are quite restrictive regarding publication of specific rank and frequency information. They require banding of frequency into 20 bands or less.

    Hope to hear back.

    Thanks,

    Miguel

    Like

  2. Hi Miguel,

    Thanks for getting in touch.
    As far as I’m aware, this data cannot be used for commercial purposes, only for non-commercial educational purposes.
    Sorry about that.

    Best,

    Paul

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.