Just Split, Dropout and Pay Attention

a recipe for low-resource NMT

procession to the Church of St. Rocco and St. Francis of Paola

Recent research and our own experiments have shown that it is possible to create neural machine translators that achieve relatively high BLEU scores with small datasets of parallel text.

The trick is to train a smaller model for the smaller dataset.

Training a large model on a small dataset is comparable to estimating a regression model with a large number of parameters on a dataset with few observations:  It leaves you with too few degrees of freedom. The model thus becomes over-fit and does not make good predictions.

Reducing the vocabulary with subword-splitting training a smaller network and setting a high-dropout parameter reduce over-fitting. And self-attentional neural networks also reduce over-fitting because (compared to recurrent and convolutional networks) they are less complex. They directly model the relationships between words in a pair of sentences.

This combination of splitting, dropout and self-attention achieved a BLEU score of 25.1 on English-to-Sicilian translation and 29.1 on Sicilian-to-English with only 16,945 lines of parallel training data containing 266,514 Sicilian words and 269,153 English words.

And because the networks were small, each model took just under six hours to train on CPU.

Our success is an implementation of the best practices developed by Sennrich and Zhang (2019) with the self-attentional Transformer model developed by Vaswani et al. (2017).

For training, we used the Sockeye toolkit by Hieber et al. (2017) running on a server with four 2.40 GHz virtual CPUs.

In their best practices for low-resource NMT, Sennrich and Zhang suggest the byte-pair encoding (i.e. subword-splitting) developed by Sennrich, Haddow and Birch (2016), a smaller neural network with fewer layers, smaller batch sizes and larger dropout parameters.

Using those best practices in the "BiDeep RNN" architecture proposed by Miceli Barone et al. (2017), they achieved a BLEU score of 16.6 on German-to-English translation with only 100,000 words of parallel training data.

Their largest improvements in translation quality came from the application of a byte-pair encoding (i.e. subword-splitting) that reduced the vocabulary from 14,000 words to 2000 words. But their most successful training also occurred when they set high dropout parameters.

During training, dropout randomly shuts off a percentage of units (by setting it to zero), which effectively prevents the units from adapting to each other. Each unit therefore becomes more independent of the others because the model is trained as if it had a smaller number of units, thus reducing over-fitting (Srivastava et al. (2014)).

BLEU scores

dataset subwords En-Sc Sc-En
20 2,000 11.4 12.9
21 2,000 12.9 13.3
23 3,000 19.6 19.5
24 3,000 19.6 21.5
25 3,000 21.1 21.2
27 3,000 22.4 24.1
28 3,000 22.5 25.2
29 3,000 24.6 27.0
30 3,000 25.1 29.1
30
+back
5,000 27.7
30
Books
+back
Sc:  5,000
En:  7,500
It:  5,000
19.7

35.1*

26.2

34.6*

33
homework
Books
+back
Sc:  5,000
En:  7,500
It:  5,000

35.0*

It-Sc

36.5†

36.8*

Sc-It

30.9†

* larger model,  † M2M model

datasets

dataset lines Sc words En words
20   7,721 121,136 121,892
21   8,660 146,370 146,437
23 12,095 171,278 175,174
24 13,060 178,714 183,736
25 13,392 185,540 190,538
27 13,839 190,072 195,372
28 14,494 196,911 202,652
29 16,591 258,730 261,474
30 16,945 266,514 269,153
30
+back
16,829
+3,251
261,421
+92,141
264,242
30
Books
+back
16,891
32,804
+3,250
262,582

+92,146
266,740
929,043
33
hw Sc-En
hw Sc-It
hw En-It
Books
+bk Sc→It
+bk En/It→Sc
12,357
 4,660
 4,660
 4,660
28,982
+3,250
+3,250
237,456
 30,244
 30,244



+92,146
236,568
 35,173

 35,173
836,757

model sizes

defaults  ours   larger   M2M 
layers 6 3 4 4
embedding size 512 256 384 512
model size 512 256 384 512
attention heads 8 4 6 8
feed forward 2048 1024 1536 2048

Subword-splitting and high dropout parameters helped us achieve better than expected results with a small dataset, but it was the Transformer model that pushed our BLEU scores into the double digits.

Compared to recurrent neural networks, the self-attention layers in the Transformer model more easily learn the dependencies between words in a sequence because the self-attention layers are less complex.

Recurrent networks read words sequentially and employ a gating mechanism to identify relationships between separated words in a sequence. By contrast, self-attention examines the links between all the words in the paired sequences and directly models those relationships. It's a simpler approach.

Combining these three features – subword-splitting, dropout and self-attention – yields a trained model that makes relatively good predictions. And as described on the multilingual translation page, adding Italian-English data should improve translation quality even more.

In an initial experiment, we added the Italian-English subset of Farkas' Books to our dataset and and trained two translators – one from Sicilian and Italian into English and the other from English into Sicilian and Italian.

As shown in the table above, holding model size constant reduced translation quality, an effect that is consistent with the findings of Arivazhagan et al. (2019), who show that training a larger model can improve translation quality across the board.

So to push our BLEU scores into the thirties, we trained a larger model. And, as we'll discuss on the multilingual translation page, we also trained another model that can translate between English, Sicilian and Italian.

So come to Napizia and explore all six translation directions with our Tradutturi Sicilianu!

Copyright © 2002-2022 Eryk Wdowiak