The video for my recent presentation at JALT International conference is now available! Error Spotter is a new web-app for improving students’ recognition of English grammatical errors.
Each zip file contains a separate directory for the following languages:
- Arabic (NGSL/TSL only)
- Greek (NGSL only)
- Korean (NGSL only)
Each directory contains a CSV spreadsheet for each word of the corresponding word list:
Each CSV spreadsheet contains two columns, the first containing the word in context (KWIC), and the second a translation of the sentence:
Where possible, all conjugations of the word are included (e.g. accuse, accused, accusing, etc).
1. Example sentences are not available for every word. Some languages/word lists have more example sentences than others.
2. Sentences were run through a rudimentary “profanity filter” in an attempt to remove inappropriate content. However, as all sentences come from a crowd-sourced database, care should be taken when using these sentences for pedagogical purposes. The quality of the translation also varies for the same reason.
- The Moby Lexicon Project
- BNC Baby
- Full BNC
- Project Gutenberg (Download full database)
- CMU Pronouncing Dictionary
- GNU Collaborative International Dictionary of English
- The Internet Dictionary Project
- English Wikitionary Dump
- Simple English Wiktionary Dump
- JACET 8000
- Minimal pairs in English RP
- List of homographs
- Homophones in English RP
- Google’s Official List of Bad Words
- Yasumasa Someya’s Lemmas List
- MRC Psycholinguistic Database
- Million Song Dataset
- Penn Treebank P.O.S. Tags
- Princeton University’s WordNet
- The Sentence Corpus of Remedial English
- Summer Institute of Linguistics (SIL) Word List
- The Tanaka Corpus
- The General Service List
- The New General Service List
- The Academic Word List
- The New Academic Word List
- The TOEIC Word List
- The Business Service List
- Apache Open Office MyThes
- Global WordNet
Download the CSV of all 2,313 terms (inc. Japanese definitions) or access the full list on Quizlet.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Step 1: Assemble a corpus of TOEFL past papers
For my corpus, I used material from both the older CBT (Computer Based Test) and the current iBT (Internet Based Test). I found most of the materials online for free. Some were already in plain text format, but most were PDFs and required Optical Character Recognition (OCR) to convert to plain text. I used ABBYY’s FineReader Pro for Mac, but there are plenty of other options out there too. Some files were Microsoft Word format (.doc/.docx), and MacOS X’s batch conversion utility came in hand for these. I included model answers, listening transcripts, reading passages and multiple choice questions (prompts, distractors and answers). I tried to exclude explanations, advice and instructions from the authors and/or publishers.
Ultimately, I ended up with corpus just shy of a million words (959,124 to be precise). In general, bigger is better when it comes to corpus research. The TOEIC Service List (TSL) utilizes a corpus of about 1.5 million words, so my TOEFL corpus seems roughly comparable to this.
Step 2: Count the number of occurrences of each word
I used some custom PHP code to process my corpus data (although Python is probably more suited for corpus analysis). I lemmatized each token where possible using Yasumasa Someya’s list of lemmas. I then cross referenced each lemma occurrence with the NGSL, NAWL and TSL. Finally, I exported to a CSV, and ended up with 13,287 rows of data.
Step 3: Curate the final list
For my final list I removed any words which also appear on the NGSL, any contractions (e.g. “Don’t”,”I’m”,”that’s”), any numbers written in word form (e.g. “two”,”million”), any vocalizations (e.g. “uh”,”oh”), any ordinals (e.g. “first”,”second”,”third”), any proper nouns (“James”, “Elizabeth”, “America”, “San Francisco”, “New York”), and any words with fewer than 5 occurrences in the corpus. Next, I ran the list through a spell checker, and excluded any unrecognized words. I also excluded any non-lexical words, to leave a list consisting only of nouns, verbs, adjectives and adverbs.
Step 4. Generate flashcards
I now had a list of 2313 terms, made up of 523 adjectives, 123 adverbs, 1366 nouns, and 301 verbs. I used Text to Flash to generate Japanese definitions for each word, then uploaded the words to Quizlet, separated into part-of-speech and ordered alphabetically.
Feb. 2017 update
Unfortunately, after uploading all the flashcard sets to Quizlet, my account started to run so slowly that it became unusable. I had to remove the majority of the data from Quizlet, but I am now offering the data to download in CSV format. Users can upload the flashcards to their own Quizlet accounts if required by using the import function.
Links to the CSVs are as follows:
Each download (.zip) includes translations for: Arabic, Chinese, Dutch, English, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. Translations were automatically generated from public domain dictionary sources.
I’ve generated multilingual, part-of-speech categorized, difficulty sorted sets of flashcards for the latest New General Service List (NGSL), New Academic Word List (NAWL) and TOEIC Service List (TSL),
and added them to Quizlet.
The sets are organized in classes according to the definition language. Each class contains sets of flashcards for the four lexical parts of speech (adverbs, verbs, adjectives and nouns).
There are a maximum of 20 flashcards in each set, and the sets are ordered by difficulty (i.e. frequency), with Part 1 of each list containing the easiest (most common) words.
As no information was given about part-of-speech in the word lists themselves, I tagged the words using Moby, and selected only the most common part-of-speech for words which can be used as multiple parts-of-speech. The word “register”, for example, is listed as a noun by Moby before it is listed as a verb, so only the noun definition of “register” was included in the flashcards.
I’m an amateur coder. For the past couple of years, I’ve been developing a site called Apps 4 EFL; half LMS, half Web-Based Language Learning platform. It all started when I wanted to automatically generate language learning activities directly from Wikipedia articles. I’d had a bit of experience coding as a teenager, but then went on to pursue other interests (chiefly become an EFL teacher). It was a steep learning curve to develop the extant coding knowledge I had enough to achieve my aims, but I think I did an OK job in the end (the site works, although I doubt it’s the most efficient or clean code by a long way). The main reason I was able to achieve my aims was because of the vast array of tutorials and example code available on the web these days, including:
…to name a few. I highly recommend these resources to anybody thinking of learning coding from scratch, or developing the skills they already have.
Part 1 TLDR: I developed some useful(?) web apps for teachers and learners of EFL.
Originally, Apps 4 EFL had no way to track learners’ progress. It simply generated pedagogical activities learners could complete online. So I decided to implement some kind of tracking system, and this is where things got complicated.
Signing up for websites is one of the worst things about the internet. Period. They all seem to require a different set of information about you, none of which you really feel they need to know. Your birthday. Your email. Sometimes your telephone number and address. And the passwords. SO. MANY. PASSWORDS.
Now multiply that issue by 25, for the number of students you have in your class.
Now multiply again by 10 for the number of classes you teach a week.
Suddenly you have 250 accounts to register, and 250 students who risk forgetting the passwords they have created, not to mention the URLs of the site(s) themselves.
Part 2 TLDR: Teachers needed a way to track their users engagement with the apps, but didn’t want to register all their students for yet another website.
This is where LMSs (Learner Management Systems) come in to play. They take care of user registration so you don’t have to. LMSs such as Moodle even offer an array of different question types, some of which are amenable to EFL pedagogy. However, they are designed to accommodate a wide range of teaching and learning contexts, and therefore lack the specific tools we might want for our own unique disciplines.
But don’t think the LMS creators hadn’t thought of this issue – they had, and it wasn’t long before there were proposals for a variety of ways to link LMSs to external apps and share data between the LMS and the app.
Part 3 TLDR: LMSs can be used to manage user registration whilst facilitating access to subject specific tools.
There are several solutions now available to (amateur) app developers to get their tools working in conjunction with existing LMSs:
- Passing parameters through the URL
- Creating an LMS plugin
- Creating an LTI compliant app
Passing parameters through the URL
The first of these methods is the easiest, but the most limited, as data can only go one way (from the LMS to the app). It can be achieved through Moodle, for example, by adding the “URL” resource to a course. Once added, a section called “URL Parameters” is available, which can be used to pass information about the LMS, the course, and the specific user through URL parameters to the target tool (which can be accessed by the tool with a simple $_GET statement in PHP).
Creating an LMS plugin
The second method allows for much greater integration with the LMS, and allows data to flow both to and from the app, so scores obtained from the app can be saved directly to the LMS. However, the drawbacks are numerous. Only users of that particular LMS will be able to utilize your app, and you’ll have to develop separate versions for other LMSs (and yes – there are a quite few of them) as required. The second problem is that its much more difficult to develop LMS plugins than it is to develop standalone web-apps, not least because you have to understand the way the LMS itself is designed and written before you can even start developing your app. Tutorials, where available, may be out-of-date or incomplete.
Creating an LTI compliant app
Learning Tools Interoperability (or “LTI” for less of a mouthful) is a specification developed (and trademarked?!) by the IMS Global Learning Consortium. It basically provides a way for LMSs to interact with external apps, and most importantly for data to flow both ways, so user information can be provided to the app, and progress data can flow back to the LMS. It is by far the most promising of the three methods discussed here, and also the most complicated and difficult to implement, especially for amateur coders working alone.
Just take a look at the implementation guide. Go on, I dare you. It’s only 12,000 words long.
OK, well maybe we don’t need to read the whole thing to get it working. There must be some example code? Yes, there is. But it’s out of date (the PHP code pertains to version 1 of the specification, not the latest version 2), and the implementation instructions are enough to make you go cross-eyed. It is by no means facile to get an LTI compliant app working. Further complications are caused by the ever-so-slightly yet infuriatingly different specifications adopted by different LMSs (Canvas vs Moodle, for example). And this is coming from someone who has navigated his way through Apple’s needlessly complicated app provisioning process.
Part 4 TLDR: The current ways to integrate external apps with LMSs are either too limited or overly complex to set up
Part 5 – The final word
For educators dabbling in code (and this is something that’s going to increase as programming enters the curriculum) there needs to be a simple yet powerful way to implement web apps with LMSs. Not every app should have to provide a complicated user management system in order to track progress – not when better solutions already exist, and when doing so only complicates teacher’s lives instead of making them easier and more productive. Something between URL parameters and LTI compliance would be fantastic, and hopefully a solution will present itself in the near future.