table of contents
Sicilian language
Recent research and our own experiments have shown that it is possible to create neural machine translators that achieve relatively high BLEU scores with small datasets of parallel text.
The trick is to train a smaller model for the smaller dataset.
Training a large model on a small dataset is comparable to estimating a regression model with a large number of parameters on a dataset with few observations: It leaves you with too few degrees of freedom. The model thus becomes over-fit and does not make good predictions.
Reducing the vocabulary with subword-splitting training a smaller network and setting a high-dropout parameter reduce over-fitting. And self-attentional neural networks also reduce over-fitting because (compared to recurrent and convolutional networks) they are less complex. They directly model the relationships between words in a pair of sentences.
This combination of splitting, dropout and self-attention achieved a BLEU score of 25.1 on English-to-Sicilian translation and 29.1 on Sicilian-to-English with only 16,945 lines of parallel training data containing 266,514 Sicilian words and 269,153 English words.
And because the networks were small, each model took just under six hours to train on CPU.
Our success is an implementation of the best practices developed by Sennrich and Zhang (2019) with the self-attentional Transformer model developed by Vaswani et al. (2017).
For training, we used the Sockeye toolkit by Hieber et al. (2017) running on a server with four 2.40 GHz virtual CPUs.
In their best practices for low-resource NMT, Sennrich and Zhang suggest the byte-pair encoding (i.e. subword-splitting) developed by Sennrich, Haddow and Birch (2016), a smaller neural network with fewer layers, smaller batch sizes and larger dropout parameters.
Using those best practices in the "BiDeep RNN" architecture proposed by Miceli Barone et al. (2017), they achieved a BLEU score of 16.6 on German-to-English translation with only 100,000 words of parallel training data.
Their largest improvements in translation quality came from the application of a byte-pair encoding (i.e. subword-splitting) that reduced the vocabulary from 14,000 words to 2000 words. But their most successful training also occurred when they set high dropout parameters.
During training, dropout randomly shuts off a percentage of units (by setting it to zero), which effectively prevents the units from adapting to each other. Each unit therefore becomes more independent of the others because the model is trained as if it had a smaller number of units, thus reducing over-fitting (Srivastava et al. (2014)).
dataset | subwords | En-Sc | Sc-En | ||||||
---|---|---|---|---|---|---|---|---|---|
20 | 2,000 | 11.4 | 12.9 | ||||||
21 | 2,000 | 12.9 | 13.3 | ||||||
23 | 3,000 | 19.6 | 19.5 | ||||||
24 | 3,000 | 19.6 | 21.5 | ||||||
25 | 3,000 | 21.1 | 21.2 | ||||||
27 | 3,000 | 22.4 | 24.1 | ||||||
28 | 3,000 | 22.5 | 25.2 | ||||||
29 | 3,000 | 24.6 | 27.0 | ||||||
30 | 3,000 | 25.1 | 29.1 | ||||||
30 +back |
5,000 | 27.7 | – | ||||||
30 Books +back |
Sc: 5,000 En: 7,500 It: 5,000 |
19.7 35.1* |
26.2 34.6* |
||||||
33 homework Books +back |
Sc: 5,000 En: 7,500 It: 5,000 |
|
|
||||||
34 |
Sc: 5,000 En: 7,500 It: 5,000 |
|
|
||||||
35 |
Sc: 5,000 En: 7,500 It: 5,000 |
|
|
||||||
* larger model, † M2M model |
dataset | lines | Sc words | En words |
---|---|---|---|
20 | 7,721 | 121,136 | 121,892 |
21 | 8,660 | 146,370 | 146,437 |
23 | 12,095 | 171,278 | 175,174 |
24 | 13,060 | 178,714 | 183,736 |
25 | 13,392 | 185,540 | 190,538 |
27 | 13,839 | 190,072 | 195,372 |
28 | 14,494 | 196,911 | 202,652 |
29 | 16,591 | 258,730 | 261,474 |
30 | 16,945 | 266,514 | 269,153 |
30 +back |
16,829 +3,251 |
261,421 +92,141 |
264,242 – |
30 Books +back |
16,891 32,804 +3,250 |
262,582 – +92,146 |
266,740 929,043 – |
33 hw Sc-En hw Sc-It hw En-It Books +bk En/It→Sc |
12,357 4,660 4,660 4,660 28,982 +3,250 |
237,456 30,244 30,244 – – +92,146 |
236,568 35,173 – 35,173 836,757 – |
34 Sc-It En-It hw Sc-En hw Sc-It hw En-It +bk It→Sc |
14,102 4,735 369,260 5,914 5,914 5,914 +6,847 |
271,712 38,362 – 60,709 60,709 – +152,861 |
270,370 – 9,481,900 66,261 – 66,261 – |
35 NLLB WikiMatrix Sc-It En-It hw Sc-En hw Sc-It hw En-It +bk It→Sc |
14,118 50,000 3,581 4,735 369,260 5,914 5,914 5,914 +6,847 |
271,935 473,911 92,948 38,362 – 60,709 60,709 – +152,861 |
270,601 472,691 – – 9,481,900 66,261 – 66,261 – |
defaults | ours | larger | M2M | |
---|---|---|---|---|
layers | 6 | 3 | 4 | 4 |
embedding size | 512 | 256 | 384 | 512 |
model size | 512 | 256 | 384 | 512 |
attention heads | 8 | 4 | 6 | 8 |
feed forward | 2048 | 1024 | 1536 | 2048 |
Subword-splitting and high dropout parameters helped us achieve better than expected results with a small dataset, but it was the Transformer model that pushed our BLEU scores into the double digits.
Compared to recurrent neural networks, the self-attention layers in the Transformer model more easily learn the dependencies between words in a sequence because the self-attention layers are less complex.
Recurrent networks read words sequentially and employ a gating mechanism to identify relationships between separated words in a sequence. By contrast, self-attention examines the links between all the words in the paired sequences and directly models those relationships. It's a simpler approach.
Combining these three features – subword-splitting, dropout and self-attention – yields a trained model that makes relatively good predictions. And as described on the multilingual translation page, adding Italian-English data should improve translation quality even more.
In an initial experiment, we added the Italian-English subset of Farkas' Books to our dataset and and trained two translators – one from Sicilian and Italian into English and the other from English into Sicilian and Italian.
As shown in the table above, holding model size constant reduced translation quality, an effect that is consistent with the findings of Arivazhagan et al. (2019), who show that training a larger model can improve translation quality across the board.
So to push our BLEU scores into the thirties, we trained a larger model. And, as we'll discuss on the multilingual translation page, we also trained another model that can translate between English, Sicilian and Italian.
So come to Napizia and explore all six translation directions with our Tradutturi Sicilianu!
Copyright © 2002-2024 Eryk Wdowiak