Introducing the TfSL (TOEFL Service List)

The final list consists of 3,773 high frequency TOEFL words, and can be downloaded here.


Step 1: Assemble a corpus of TOEFL materials

TOEFLFor my corpus, I used material from both the older CBT (Computer Based Test) and the current iBT (Internet Based Test). I found most of the materials online for free. Some were already in plain text format, but most were PDFs and required Optical Character Recognition (OCR) to convert to plain text. I used ABBYY’s FineReader Pro for Mac, but there are plenty of other options out there too. Some files were Microsoft Word format (.doc/.docx), and MacOS X’s batch conversion utility came in hand for these. I included model answers, listening transcripts, reading passages and multiple choice questions (prompts, distractors and answers). I tried to exclude explanations, advice and instructions from the authors and/or publishers.

Ultimately, I ended up with corpus just shy of a million words (959,124 to be precise). In general, bigger is better when it comes to corpus research. The TOEIC Service List (TSL) utilizes a corpus of about 1.5 million words, so my TOEFL corpus seems roughly comparable to this.

Step 2: Count the number of occurrences of each word

I used some custom PHP code to process my corpus data (although Python is probably more suited for corpus analysis). I lemmatized each token where possible usingĀ Yasumasa Someya’s list of lemmas. I then cross referenced each lemma occurrence with the NGSL, NAWL and TSL. Finally, I exported to a CSV, and ended up with 13,287 rows of data.

Step 3: Curate the final list

For my final list I removed any words which also appear on the NGSL, any contractions (e.g. “Don’t”,”I’m”,”that’s”), any numbers written in word form (e.g. “two”,”million”), any vocalizations (e.g. “uh”,”oh”), any ordinals (e.g. “first”,”second”,”third”), any proper nouns (“James”, “Elizabeth”, “America”, “San Francisco”, “New York”), and any words with fewer than 5 occurrences in the corpus. Next, I ran the list through a spell checker, and excluded any unrecognized words. I also excluded any non-lexical words, to leave a list consisting only of nouns, verbs, adjectives and adverbs.

2 thoughts on “Introducing the TfSL (TOEFL Service List)

  1. I’ve seen some articles on revisions to the General Service List on Academia, but I haven’t read any of them yet. I’ve also noticed an upcoming conference in Tokyo featuring some renown scholars on vocabulary such as Paul Nation and Charlie Browne, but I forgot to bookmark it, and as far as I know, it may have already been held.
    Apps4efl.com just keeps on growing – you just keep innovating and adding new features. I’m recommending it to my classes but still not requiring it, which leads to a question: I’ve registered as a teacher, but now, I can’t seem to use some of my old favorites, such as the Wikipedia generated cloze exercises without creating class lists first. Is there some way around that? Keep up the great work Paul, and let me know how I can help spread word of your website besides just using it in my own classes. Oh wow – I’ve just noticed that you now have Facebook, Twitter and WordPress log in !! I’ll check to see if that applies to apps4efl.com as well – NICE !!! Btw, I’ve included my wordpress page below, but I only use it to archive a few things.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s