Natural Language

machine translation

When we first set out to create a machine translator for the Sicilian language, we thought that we would have to take a rule-based approach, like the Sicilian-Spanish translator that Uliana Sentsova created for Apertium during the 2016 GSoc.

We thought that the limited number of parallel texts, the diversity of the Sicilian language and the diverse ways that the Sicilian language has been written would make it impossible to use statistical methods to create a machine translator.

We may have been wrong. Luong and Manning (2015) successfully created an English-Vietnamese translator with only 133,000 sentence pairs, using Neural Machine Translation (NMT), so this page explores the possibility of using that model to create an English-Sicilian translator.

Pitrè's Folk Tales

To seed the project, Arthur Dieli kindly provided 34 translations of Giuseppe Pitrè's Sicilian Folk Tales and lots of encouragement. We also received good advice and support from Gaetano Cipolla of Arba Sicula.

In our first effort, we aligned whole sentences. And, in an effort to create the largest dataset possible, we also included translations of Sicilian poetry from Dr. Dieli's website.

That effort showed us that the model can work, but the quality was poor. The trained model could not identify translations of some of the most common words in the vocabulary.

At Dr. Cipolla's suggestion, we then helped the model find the translations by breaking long sentences into shorter sentences and clauses. And to keep the language relatively homogenous, we only included the Folk Tales.

That shrunk the dataset to about half its previous size, but gave us a much better result. The second translator that we created was able to correctly translate phrases that are well-represented in the dataset.

Therefore, if all the sentences in the dataset are in standard Sicilian, it should be possible to create a neural machine translator for the Sicilian language. So in our third effort, we standardized the Sicilian text and used subword splitting, which yielded a large improvement in translation quality.

In addition to Pitrè's Folk Tales, the dataset used in the third effort also includes Dr. Dieli's translations of Sicilian proverbs and one issue of Arba Sicula. It only contains 40,000 English words and 37,000 Sicilian words, so we still have to assemble more parallel text. But the results were very encouraging, so we posted the model at Napizia.


Below are a few translations from our most recent model. The dataset is still small, so the resulting translator is not very useful, but it does correctly translate some phrases that appear frequently in the dataset:

>>> top_trans('the neapolitan and the sicilian', nu_trans=1)

the neapolitan and the sicilian

lu napulitanu e lu sicilianu

>>> top_trans('the large hat pays for all .', nu_trans=1)

the large hat pays for all .

cappiddazzu paga tuttu.

>>> top_trans('it was the scissors !', nu_trans=1)

it was the scissors !

forfici foru!

>>> top_trans('car@@ in@@ isi are dogs !', nu_trans=1)

car@@ in@@ isi are dogs !

cani carinisi!

Note that the last translation ("Carinisi are dogs!") shows the subword splitting method described on the next page.

Copyright © 2002-2019 Eryk Wdowiak