Developing a Parallel Corpus
table of contents
When we first set out to create a machine translator for the Sicilian language, we thought that we would have to take a rule-based approach, like the Sicilian-Spanish translator that Uliana Sentsova created for Apertium during the 2016 GSoc.
We thought that the limited number of parallel texts, the diversity of the Sicilian language and the diverse ways that the Sicilian language has been written would make it impossible to use statistical methods to create a machine translator.
Just a few years ago, Koehn and Knowles (2017) calculated learning curves for English-to-Spanish translation. At 377,000 words, the BLEU scores were 1.6 for neural machine translation, 16.4 for statistical machine translation and 21.8 for statistical with a big language model.
But recent advances in the field of neural machine translation have enabled us to obtain better scores with half the number of words.
As described on the low-resource NMT page, the self-attentional Transformer model developed by Vaswani et al. (2017) has replaced the recurrent neural networks as the state of the art approach. It improves translation quality in low-resource settings because it directly models the relationships between all the words in a sequence.
More recently, Sennrich and Zhang (2019) found that using the method of subword splitting to reduce the vocabulary to 2000 subwords also delivers large improvements in translation quality in low-resource experiments. And they obtained further improvements by setting high dropout parameters.
Our efforts to develop a translator for the Sicilian language build on their success. But first we need to collect some text.
Repositories of open-source parallel text, like OPUS, do not have any Sicilian language resources. There are no government documents, Wikipedia articles or movie subtitles that we can use as a source of parallel text. But good resources can be found elsewhere.
To seed this project, Arthur Dieli kindly provided 34 translations of Giuseppe Pitrè's Sicilian Folk Tales and lots of encouragement. And Arba Sicula, which has been translating Sicilian literature into English for over 40 years, contributed its bi-lingual journal of Sicilian history, language, literature, art, folklore and cuisine.
Just as importantly, Arba Sicula developed a standard form of the language, providing the consistency we need in a sea of orthographic and dialectical diversity. Its editor and director, Gaetano Cipolla, has also helped me understand the language, answered countless questions and offered great assistance.
Most of our dataset comes from Arba Sicula articles. Some comes from Dr. Dieli's translations of Pitrè's Folk Tales. And some comes from translations of the homework exercises in the Mparamu lu sicilianu (Cipolla, 2013) and Introduction to Sicilian Grammar (Bonner, 2001) textbooks.
Although it only makes up a small portion of the dataset, adding the textbook examples yielded large improvements in translation quality on a test set drawn only from Arba Sicula articles. I still have to do more experiments to confirm my findings, but it seems to me that if a grammar book helps a human learn in a systematic way, then a grammar book should also help a machine learn in a systematic way. At least that's what seems to be happening here.
Another (ironic) source of parallel text is monolingual text.
Efforts to create neural machine translators for other low-resource languages often involve the back-translation method developed by Sennrich, Haddow and Birch (2015), in which monolingual, target-side text is used to supplement the available parallel text.
We may make more use this method in the future. So far we have not used it much because assembling Sicilian monolingual text requires almost as much time as assembling parallel text. Nonetheless, we also have some leftover unmatched text, which we can use for back-translation.
For example, to develop our English-to-Sicilian model, we could automatically translate the unmatched Sicilian text into English to create a "synthetic dataset" of real Sicilian sentences and synthetic English sentences. Then we would train a new English-to-Sicilian model on the combination of the parallel and synthetic data.
And in general, you can always find ways to assemble more parallel text.
Copyright © 2002-2022 Eryk Wdowiak