table of contents
In a recent case study, Sennrich and Zhang (2019) develop a set of best practices for low-resource neural machine translation and show that those best practices can achieve better translation quality than phrase-based statistical machine translation in a 100,000 word dataset derived from the 2014 German-English IWSLT.
In their best practices, they suggest using a smaller neural network with fewer layers, smaller batch sizes and a larger the dropout parameter. And their largest improvements in translation quality (as measured by BLEU score) came from the application of a byte-pair encoding that reduced the vocabulary from 14,000 words to 2000 words.
Using those best practices on 40,000 words from the Sicilian-English dataset that we are developing achieves BLEU scores of 3.47 on the English-to-Sicilian and 4.78 on the Sicilian-to-English.
Those BLEU scores are low, but better than the BLEU score of 1.6 that Koehn and Knowles (2017) obtained with 377,000 words on English-to-Spanish neural machine translation.
This result encourages me to continue assembling Sicilian-English parallel text and gives me hope that neural machine translation will soon be available for all the world's languages.
For example, the English present tense only has two forms: "speak" and "speaks." By contrast, Sicilian has six different forms for the present tense. But splitting them into subwords:
yields something closer to English: "parr" matches "speak" and the Sicilian verb endings match the English pronouns.
Diminuitives and augmentatives can be similarly expressed as sequences of subword units:
Subword splitting allows us represent many different word forms in a much smaller vocabulary, thus allowing the translator to learn rare words and unknown words. So even if "jo manciu" ("I eat") does not appear at all in the dataset, but forms like "jo parru" ("I speak") and "iddu mancia" ("he eats") do appear, then subword splitting would allow the translator to learn "jo manciu" ("I eat").
With a vocabulary of 1500 subwords, the sentence: "Carinisi are dogs!" gets tokenized and split into:
car@@ in@@ isi are dogs !
which is translated into Sicilian as:
cani car@@ in@@ isi !
and detokenized into: "Cani carinisi!"
Because subword splitting appears to be an effective tool for developing a neural machine translator for the Sicilian language, we will continue assembling parallel text and hope to present a better quality translator soon. In the meantime, you can see the results of this experiment at Napizia.
Copyright © 2002-2019 Eryk Wdowiak