Natural Language

Just Split, Dropout and Pay Attention

Recent research and our own experiments has shown that it is possible to create neural machine translators that achieve relatively high BLEU scores with small datasets of parallel text.

The trick is to train a smaller model for the smaller dataset.

Training a large model on a small dataset is comparable to estimating a regression model with a large number of parameters on a dataset with few observations:  It leaves you with too few degrees of freedom. The model thus becomes over-fit and does not make good predictions.

Reducing the vocabulary with subword-splitting training a smaller network and setting a high-dropout parameter reduce over-fitting. And self-attentional neural networks also reduce over-fitting because (compared to recurrent and convolutional networks) they are less complex. They directly model the relationships between words in a pair of sentences.

This combination of splitting, dropout and self-attention achieved a BLEU score of 11.4 on English-to-Sicilian translation and 12.9 on Sicilian-to-English with only 7721 lines of parallel training data containing 121,892 English words and 121,136 Sicilian words.

And because the networks were small, each model took less than four hours to train on CPU.

Our success is an implementation of the best practices developed by Sennrich and Zhang (2019) with the self-attentional Transformer model developed by Vaswani et al. (2017).

For training, we used the Sockeye toolkit by Hieber et al. (2017) running on a server with four 2.40 GHz virtual CPUs.

In their best practices for low-resource NMT, Sennrich and Zhang suggest the byte-pair encoding (i.e. subword-splitting) developed by Sennrich, Haddow and Birch (2016), a smaller neural network with fewer layers, smaller batch sizes and larger dropout parameters.

Using those best practices in the "BiDeep RNN" architecture proposed by Miceli Barone et al. (2017), they achieved a BLEU score of 16.6 on German-to-English translation with only 100,000 words of parallel training data.

Their largest improvements in translation quality came from the application of a byte-pair encoding (i.e. subword-splitting) that reduced the vocabulary from 14,000 words to 2000 words. But their most successful training also occurred when they set high dropout parameters.

During training, dropout randomly shuts off a percentage of units (by setting it to zero), which effectively prevents the units from adapting to each other. Each unit therefore becomes more independent of the others because the model is trained as if it had a smaller number of units, thus reducing over-fitting (Srivastava et al. (2014)).

Subword-splitting and high dropout parameters helped us achieve better than expected results with a small dataset, but it was the Transformer model that pushed our BLEU scores into the double digits.

Compared to recurrent neural networks, the self-attention layers in the Transformer model more easily learn the dependencies between words in a sequence because the self-attention layers are less complex.

Recurrent networks read words sequentially and employ a gating mechanism to identify relationships between separated words in a sequence. By contrast, self-attention examines the links between all the words in the paired sequences and directly models those relationships. It's a simpler approach.

Combining these three features – subword-splitting, dropout and self-attention – yields a trained model that makes relatively good predictions. And as we add more parallel text to our dataset, the translation quality will improve even more.

In the meantime, we invite you to see the results at Napizia.

Copyright © 2002-2020 Eryk Wdowiak