Importing and restructuring data from tatoeba.org

I recently decided to create a new English learning web-app, Sentence Builder, using the vast creative commons database of example sentences produced by tatoeba.org. In this post, I’ll explain the process of how I got the two data files from tatoeba.org into a more useable form, and provide the CSV file I was able to generate by restructuring the tatoeba.org data.

Tatoeba.org provides their entire example sentence database as a free (creative commons licensed) download.

Among the various files provided, the two essential ones are:

a) The list of all example sentences
b) The list of links between sentences

The first file (193MB decompressed) is a CSV consisting of 3,428,026 rows of three columns. The first column contains a unique sentence ID, the second column contains the language code for the sentence, and the third column contains the sentence itself. A typical line looks something like this:

1    cmn    我們試試看!

Here we can see that the sentence with ID “1” is in Chinese Mandarin.

The second file (109MB decompressed) is a CSV consisting of 7,271,303 rows of two columns. The first column contains one sentence ID, and the second column contains a matching sentence ID in another language. A typical line looks something like this:

1    1276

Here we can see that sentence ID 1 (the Chinese Mandarin sentence above) matches to sentence 1276:

1276    eng    Let’s try something.

So now we know that “我們試試看!” in Chinese Mandarin equals “Let’s try something” in English.

Three important points about the links file: 1) the links are in no particular order; 2) one sentence obviously links to more than one other sentence; 3) reciprocal links are also provided (for some weird reason). So, for sentence 1, the links are as follows:

1    77
1    1276
1    2481
1    5350
1    5972
1    180624
1    344899
1    345549
1    380381
…etc

but we also have:

77    1
1276    1
2481    1

…etc

Great. Not exactly a useful data format. What I really wanted was a single CSV, with each column containing the language code, and each row containing the example sentence id, followed by the translations in the various language provided. Regarding the language codes: there are a QUITE A FEW. Here are all the language codes used, and the languages they represent:

  • afr, Afrikaans
  • amh, Amharic
  • ang, Old English
  • ara, Arabic
  • arq, Algerian Arabic
  • arz, Egyptian Arabic
  • ast, Asturian
  • avk, Kotava
  • aze, Azerbaijani
  • bak, Bashkir
  • bel, Belarusian
  • ben, Bengali
  • ber, Berber
  • bod, Tibetan
  • bos, Bosnian
  • bre, Breton
  • bul, Bulgarian
  • cat, Catalan
  • cbk, Chavacano
  • ces, Czech
  • cha, Chamorro
  • chr, Cherokee
  • chv, Chuvash
  • cmn, Mandarin
  • cor, Cornish
  • cycl, CycL
  • cym, Welsh
  • dan, Danish
  • deu, German
  • ell, Greek
  • eng, English
  • enm, Middle English
  • epo, Esperanto
  • est, Estonian
  • eus, Basque
  • fao, Faroese
  • fin, Finnish
  • fra, French
  • fry, Frisian
  • gla, Scottish Gaelic
  • gle, Irish
  • glg, Galician
  • grc, Ancient Greek
  • grn, Guarani
  • guj, Gujarati
  • hat, Haitian Creole
  • haw, Hawaiian
  • heb, Hebrew
  • hin, Hindi
  • hrv, Croatian
  • hsb, Upper Sorbian
  • hun, Hungarian
  • hye, Armenian
  • ido, Ido
  • ile, Interlingue
  • ina, Interlingua
  • ind, Indonesian
  • isl, Icelandic
  • ita, Italian
  • jbo, Lojban
  • jpn, Japanese
  • kal, Greenlandic
  • kat, Georgian
  • kaz, Kazakh
  • khm, Khmer
  • kin, Kinyarwanda
  • kor, Korean
  • lad, Ladino
  • lat, Latin
  • lit, Lithuanian
  • liv, Livonian
  • lkt, Lakota
  • lld, Ladin
  • ltz, Luxembourgish
  • lvs, Latvian
  • lzh, Literary Chinese
  • mal, Malayalam
  • mar, Marathi
  • mgm, Mambae
  • mhr, Meadow Mari
  • mkd, Macedonian
  • mlg, Malagasy
  • mlt, Maltese
  • mnw, Mon
  • mon, Mongolian
  • mri, Maori
  • nav, Navajo
  • nds, Low Saxon
  • nld, Dutch
  • nob, Norwegian (Bokmal)
  • non, Norwegian (Nynorsk)
  • nov, Novial
  • oci, Occitan
  • orv, Old East Slavic
  • oss, Ossetian
  • pes, Persian
  • pms, Piedmontese
  • pol, Polish
  • por, Portuguese
  • ppl, Pipil
  • que, Quechua
  • qya, Quenya
  • roh, Romansh
  • ron, Romanian
  • rus, Russian
  • sah, Yakut
  • san, Sanskrit
  • shs, Shuswap
  • sjn, Sindarin
  • slk, Slovak
  • slv, Slovenian
  • sna, Shona
  • spa, Spanish
  • sqi, Albanian
  • srp, Serbian
  • swe, Swedish
  • swh, Swahili
  • tam, Tamil
  • tat, Tatar
  • tgl, Tagalog
  • tha, Thai
  • tlh, Klingon
  • toki, Toki Pona
  • tpw, Old Tupi
  • tur, Turkish
  • uig, Uyghur
  • ukr, Ukrainian
  • urd, Urdu
  • uzb, Uzbek
  • vie, Vietnamese
  • vol, Volapuk
  • wuu, Shanghainese
  • xal, Kalmyk
  • xho, Xhosa
  • yid, Yiddish
  • yor, Yoruba
  • yue, Cantonese
  • zsm, Malay
  • zul, Zulu

As we can see, there are some very obscure and/or made-up languages (yes, I know technically all languages are ‘made up’, but you know what I mean), including the likes of Klingon, Middle English, Literary Chinese, etc. Furthermore, some of these languages only had a handful of example sentences. So, after purging the fictional, obscure, and under-represented languages, I got the list of languages down to the following 75:

  • Afrikaans
  • Albanian
  • Algerian Arabic
  • Amharic
  • Arabic
  • Armenian
  • Azerbaijani
  • Basque
  • Belarusian
  • Bengali
  • Berber
  • Bosnian
  • Bulgarian
  • Cantonese
  • Catalan
  • Cornish
  • Croatian
  • Czech
  • Danish
  • Dutch
  • Egyptian Arabic
  • Esperanto
  • Estonian
  • Finnish
  • French
  • Galician
  • Georgian
  • German
  • Greek
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Indonesian
  • Irish
  • Italian
  • Japanese
  • Kazakh
  • Khmer
  • Korean
  • Latin
  • Latvian
  • Lithuanian
  • Low Saxon
  • Malay
  • Malayalam
  • Mandarin
  • Marathi
  • Mongolian
  • Norwegian (Bokmal)
  • Norwegian (Nynorsk)
  • Old East Slavic
  • Persian
  • Polish
  • Portuguese
  • Quenya
  • Romanian
  • Russian
  • Scottish Gaelic
  • Serbian
  • Shanghainese
  • Slovak
  • Slovenian
  • Spanish
  • Swahili
  • Swedish
  • Tagalog
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Welsh
  • Yiddish

Next, I set about merging the two files into a more useable form, which could then be used to populate a MySQL database. With over 7 million example sentences, this was not a task that could be done by hand. I’ve been slowly getting to grips with Python recently, which is a very useful programming language for manipulating text files and CSVs. Below is the Python script I used to merge the links and example sentence files:

all_links_array={}
eng_lines_array={}
all_lines_array={}
missing_lines_array=[]
with open("all_links_dupe.txt") as f:
	all_links= f.readlines()
	n=1
	mini_set=set()
	for line in all_links:
		link_a,link_b=line.strip().split("#")
		if (int(link_a)==n):
			mini_set.add(int(link_b))
		else:
			all_links_array[n]=mini_set.copy()
			mini_set.clear()
			mini_set.add(int(link_b))
			n=n+1
with open("eng_lines_dupe.txt") as f:
	eng_lines= f.readlines()
	for line in eng_lines:
		id,content=line.strip().split("#eng")
		eng_lines_array[int(id)]=content	
eng_lines_array_keys=set(eng_lines_array.keys())
all_links_array_keys=set(all_links_array.keys())
matching_array_keys = eng_lines_array_keys.intersection(all_links_array_keys)
with open("all_lines_dupe.txt") as f:
	all_lines= f.readlines()
	for line in all_lines:
		id,content=line.strip().split("SPLITHERE")
		all_lines_array[int(id)]=content
errors=0
matches=0
for key in matching_array_keys:
	print "n""+eng_lines_array[key]+"", ",
	mini_set=set(all_links_array[key])
	for key2 in mini_set:
		if key2 in all_lines_array:
			print """+all_lines_array[key2]+"", ",
			matches=matches+1
		else:
			errors=errors+1
			missing_lines_array.append(key2)
print missing_lines_array
print "Missing keys: "+str(errors)
print "Matched keys: "+str(matches)

I should note that I had previously separated the English example sentences from the rest of the example sentences, as I was only concerned with sentences that had an English translation available. This gave me three files: all_links_dupe.txt, all_lines_dupe.txt and eng_lines_dupe.txt. I also reformatted the files to make them easier to parse with my Python script. After the reformatting, the files looked like this:

all_lines_dupe:

1SPLITHEREcmn~我們試試看!
2SPLITHEREcmn~我该去睡觉了。
3SPLITHEREcmn~你在干什麼啊?
4SPLITHEREcmn~這是什麼啊?

…etc

all_links_dupe:

1#77
1#1276
1#2481
1#5350

…etc

eng_lines_dupe:

1000009#eng@Are they live?
1000113#eng@That is my book.
1000136#eng@I drink coffee.
1000480#eng@How can I ever forgive myself?

…etc

The results that the Python script generated looked like this:

“@How much come we know each other?”,”epo~Mi ĝojas revidi vin.”
“@Alice (who must certainly be you)’s e-mail got read.”,”jbo~lo puza selmri be Alis noi ju’o do pu tcidu ke’a”,”mar~अॅलीसचं (जी म्हणजे नक्की तुम्हीच असाल) ईमेल वाचलं गेलं.”
“@Oh geez, I’m not sure, but I think I ate too much.”,”jbo~.ii ju’o cu’i mi pu zi dukse citka”
“@Eimi, you assume it’s possible when you find the truth, which is impossible.”,”jbo~doi la Eimis do sruma lo du’u cumki lenu facki roda poi jetnu ju’onai”

I also had to write a PHP script to populate my database, as the text file generated by the Python script was too big to easily import. I saved the outputted data as “matches.txt” and then split this file using Text File Splitter for Mac. (This was necessary because my PHP script kept timing out when I tried the process with the entire matches.txt file). I wrote a PHP script to populate a MySQL database with the data generated by the Python script:

error_reporting(E_ALL);
echo "Working...";
$con=mysqli_connect("domain","user","password","database");
if (mysqli_connect_errno()){echo "Failed to connect to MySQL: " . mysqli_connect_error();}
$con->set_charset("utf8");

$sentences = file('matches/matches_dupe-47.txt');
foreach($sentences as $sentence){
$translations=explode('","',$sentence);
foreach($translations as $translation){
$translation=str_replace('"@','@',$translation);
$translation=str_replace('",','',$translation);
if ($translation[0]=="@"){
$insert_trans=substr($translation,1);
mysqli_query($con,"INSERT INTO sentences (eng) VALUES ('$insert_trans')");
$id = mysqli_insert_id($con);
}
else {
$tilde=strpos($translation,"~");
$language=substr($translation,0,$tilde);
$remainder=trim(substr($translation,$tilde+1));
//$sql="SELECT $language FROM sentences";
//$result=mysqli_query($con,$sql);
//if ($result==false){
//$sql="ALTER TABLE sentences ADD $language TEXT";
//mysqli_query($con,$sql);
//$sql="UPDATE sentences SET ".$language."='$remainder' WHERE sentence_id='$id'";
//mysqli_query($con,$sql);
//}
//else{
$sql="UPDATE sentences SET ".$language."='$remainder' WHERE sentence_id='$id'";
mysqli_query($con,$sql);
//}

}
}
}
echo "Complete!";

The commented section was only used on the first few of iterations to generate columns for each of the language codes in the matches.txt files. This populated a MySQL table, with the sentence ID in the first column, the English sentence in the second column and translations in all available languages in the following columns.

I was then able to export the MySQL database into a CSV, with each column containing the language code, and each row containing the example sentence id, followed by the translations in the various language provided – finally in the data format I wanted.