table of contents
So far, we have limited our dataset to Sicilian-English parallel text. But there's no reason to do so. With a small modification, we can train a single model to translate between multiple languages, including some for which there is little or no parallel text.
For example, if we did not have any Sicilian-Italian parallel text at all, we could still develop a model that translates between Sicilian and Italian ("zero shot" translation) by adding Italian-English parallel text to our dataset.
And if we have some Sicilian-Italian parallel text, then it's even possible to achieve high translation quality between Sicilian and Italian.
The small modification is to add a directional token to the beginning of the source sequence. Johnson et al. (2016) show that that single addition enables multilingual translation in an otherwise conventional model.
It's an example of transfer learning. In our case, as the model learns to translate from Italian to English, it would also learn to translate from Sicilian to English. And as the model learns to translate from English to Italian, it would also learn how to translate from English to Sicilian.
More parallel text is available for some languages than others however, so Johnson et al. also studied the effect on translation quality and found that oversampling low-resource language pairs improves their translation quality, but at expense of quality among high-resource pairs.
Importantly however, the comparison with bilingual translators holds constant the number of parameters in the model. Arivazhagan et al. (2019) show that training a larger model can improve translation quality across the board.
More recently, Fan et al. (2020) developed the strategies to collect data for and to train a model that can directly translate between 100 languages. Previous efforts had resulted in poor translation quality in non-English directions because the data consisted entirely of translations to and from English.
To overcome the limitations of English-centric data, Fan et al. strategically selected pairs to mine data for, based on geography and linguistic similarity. Training a model on such a more multilingual dataset yielded very large improvements in translation quality in non-English directions, while matching translation quality in English directions.
Given such potential to expand the directions in which languages can be translated and to improve the quality with which they can be translated, an important question is what the model learns. Does it learn to represent similar sentences in similar ways regardless of language? Or does it represent similar languages in similar ways?
Johnson et al. examined two trained trilingual models. In one, they observed similiar representations of translated sentences, while in the second they noticed that the representations of zero-shot translations were very different.
Kudugunta et al. (2019) examined the question in a model trained on 103 languages and found that the representations depend on both the source and target languages and they found that the encoder learns a representation in which linguistically similar languages cluster together.
In other words, because similar languages learn similar representations, our model would learn Sicilian-English better from Italian-English data than from Polish-English data. And other Romance languages, like Spanish, would also be good languages to consider.
We can collect some of that parallel text from the resources at OPUS, an open repository of parallel corpora. Because it contains so many language resources, Zhang et al. (2020) recently used it to develop the OPUS-100 corpus, an open-source collection of English-centric parallel text for 100 languages.
Because it's a "rough and ready" massively multilingual dataset, it highlights some of the challenges facing massively multilingual translation. In particular, Zhang et al. show that a model trained with a vanilla setup exhibits off-target translation issues in zero-shot directions. In the English-centric case, that means the model often translates into the wrong language when not translating to or from English.
Zhang et al. tackle this challenge by simulating the missing translation directions. They first observe that Sennrich, Haddow and Birch's (2015) method of back-translation "converts the zero-shot problem into a zero-resource problem" because it creates synthetic source language text. They then observe that this synthetic source language text simulates the missing translation directions.
The only obstacle is scalability. In a massively multilingual context, there are thousands of translation directions, which requires prohibitively many back-translations. To overcome this obstacle, Zhang et al. incorporate back-translation directly into the training process. And their final models exhibit improved translation quality and fewer off-target translation errors.
So we're excited about the potential for multilingual translation to improve translation quality and to create new translation directions for the Sicilian language. And we still have several issues of Arba Sicula to assemble Sicilian-English parallel text from, so we also need to finish collecting that parallel text. We'll keep working on it.
Copyright © 2002-2021 Eryk Wdowiak