Just Split, Dropout and Pay Attention

a recipe for low-resource NMT

procession to the Church of St. Rocco and St. Francis of Paola

natural language

table of contents

Sicilian language

dictionary specification

references

resources

Project Napizia

Recent research and our own experiments have shown that it is possible to create neural machine translators that achieve relatively high BLEU scores with small datasets of parallel text.

The trick is to train a smaller model for the smaller dataset.

Training a large model on a small dataset is comparable to estimating a regression model with a large number of parameters on a dataset with few observations: It leaves you with too few degrees of freedom. The model thus becomes over-fit and does not make good predictions.

Reducing the vocabulary with subword-splitting training a smaller network and setting a high-dropout parameter reduce over-fitting. And self-attentional neural networks also reduce over-fitting because (compared to recurrent and convolutional networks) they are less complex. They directly model the relationships between words in a pair of sentences.

This combination of splitting, dropout and self-attention achieved a BLEU score of 25.1 on English-to-Sicilian translation and 29.1 on Sicilian-to-English with only 16,945 lines of parallel training data containing 266,514 Sicilian words and 269,153 English words.

And because the networks were small, each model took just under six hours to train on CPU.

Our success is an implementation of the best practices developed by Sennrich and Zhang (2019) with the self-attentional Transformer model developed by Vaswani et al. (2017).

For training, we used the Sockeye toolkit by Hieber et al. (2017) running on a server with four 2.40 GHz virtual CPUs.

In their best practices for low-resource NMT, Sennrich and Zhang suggest the byte-pair encoding (i.e. subword-splitting) developed by Sennrich, Haddow and Birch (2016), a smaller neural network with fewer layers, smaller batch sizes and larger dropout parameters.

Using those best practices in the "BiDeep RNN" architecture proposed by Miceli Barone et al. (2017), they achieved a BLEU score of 16.6 on German-to-English translation with only 100,000 words of parallel training data.

Their largest improvements in translation quality came from the application of a byte-pair encoding (i.e. subword-splitting) that reduced the vocabulary from 14,000 words to 2000 words. But their most successful training also occurred when they set high dropout parameters.

During training, dropout randomly shuts off a percentage of units (by setting it to zero), which effectively prevents the units from adapting to each other. Each unit therefore becomes more independent of the others because the model is trained as if it had a smaller number of units, thus reducing over-fitting (Srivastava et al. (2014)).

BLEU scores

dataset

subwords

En-Sc

Sc-En

2,000

11.4

12.9

2,000

12.9

13.3

3,000

19.6

19.5

3,000

19.6

21.5

3,000

21.1

21.2

3,000

22.4

24.1

3,000

22.5

25.2

3,000

24.6

27.0

3,000

25.1

29.1

30
+back

5,000

27.7

–

30
Books
+back