Natural Language

Subword Splitting

In a recent case study, Sennrich and Zhang (2019) develop a set of best practices for low-resource neural machine translation and show that those best practices can achieve better translation quality than phrase-based statistical machine translation in a 100,000 word dataset derived from the 2014 German-English IWSLT.

In their best practices, they suggest using a smaller neural network with fewer layers, smaller batch sizes and a larger the dropout parameter. And their largest improvements in translation quality (as measured by BLEU score) came from the application of a byte-pair encoding that reduced the vocabulary from 14,000 words to 2000 words.

Using those best practices on 40,000 words from the Sicilian-English dataset that we are developing achieves BLEU scores of 3.47 on the English-to-Sicilian and 4.78 on the Sicilian-to-English.

Those BLEU scores are low, but better than the BLEU score of 1.6 that Koehn and Knowles (2017) obtained with 377,000 words on English-to-Spanish neural machine translation.

This result encourages me to continue assembling Sicilian-English parallel text and gives me hope that neural machine translation will soon be available for all the world's languages.

The byte-pair encoding algorithm developed by Sennrich, Hadlow and Birch (2016) replaces the fixed vocabulary of the usual model with a vocabulary of "subwords."

For example, the English present tense only has two forms: "speak" and "speaks." By contrast, Sicilian has six different forms for the present tense. But splitting them into subwords:

Sicilian English
parr + u I  speak
parr + i you  speak
parr + a he  speak + s
parr + amu we  speak
parr + ati you  speak
parr + anu they  speak

yields something closer to English: "parr" matches "speak" and the Sicilian verb endings match the English pronouns.

Diminuitives and augmentatives can be similarly expressed as sequences of subword units:

Sicilian English
jatt + u cat
jatt + ar + eddu little  cat
banc + u bench
banc + ar + eddu little  bench

Subword splitting allows us represent many different word forms in a much smaller vocabulary, thus allowing the translator to learn rare words and unknown words. So even if "jo manciu" ("I eat") does not appear at all in the dataset, but forms like "jo parru" ("I speak") and "iddu mancia" ("he eats") do appear, then subword splitting would allow the translator to learn "jo manciu" ("I eat").

With a vocabulary of 1500 subwords, the sentence: "Carinisi are dogs!" gets tokenized and split into:

car@@ in@@ isi are dogs !

which is translated into Sicilian as:

cani car@@ in@@ isi !

and detokenized into: "Cani carinisi!"

Because subword splitting appears to be an effective tool for developing a neural machine translator for the Sicilian language, we will continue assembling parallel text and hope to present a better quality translator soon. In the meantime, you can see the results of this experiment at Napizia.

Copyright © 2002-2019 Eryk Wdowiak