Data Science needs both Data and Science

13 December 2020

A friend asked me to help him understand a recording of his grandfather. It's in Sicilian, but not standard Sicilian. His grandfather spoke in his native dialect.

And because it's not in standard Sicilian, my friend was skeptical that I could help him. It's always good to be a little skeptical, but I don't like being unable to do something, so I want to figure it out.

How should we transcribe his grandfather's speech? Once we do, we translate it with relative ease, but first we have to transcribe it.

I tell this story because it's the story of how human beings record information. We need a science to interpret data. And we also need a science to collect data.

Recent advances in the fields of deep learning and natural language processing enabled us to create a machine translator for the Sicilian language. That's the science that interprets data. But before my friend can take advantage of it, we also need a science to collect the data in the recording of his grandfather.

That science started several thousand years ago when humans began recording their words in cuneiforms and hieroglyphs. The use symbols to represent sounds (i.e. an alphabet) is a relatively recent innovation.

We'll use an alphabetic script to represent his grandfather's speech. But first, we have to make a choice. Should that representation be a phonetic transcription of what his grandfather said? Or should it be a standard Sicilian transcription of what his grandfather said?

For comparison, imagine that we're transcribing the speech of someone from Massachusetts. They might talk about "Hahvahd," but when we transcribe their speech we usually use the letter R and write: "Harvard," unless (of course) we're trying to represent the peculiar way they speak.

My friend has told me that he wants to learn how to speak like his grandfather, so we'll certainly prepare the phonetic transcription for my friend, but in most cases one would usually prefer the standard transcription.

And so that my friend (and all Sicilians) can see the similarities between the different dialects of Sicilian, we should create comparison tables. That way my friend (and other learners) can use standard Sicilian to learn the dialect of their grandparents.

Phonetic transcriptions are also great for poetry, where the sound – the rhythm and rhyme – are important. But in most cases one would usually prefer the consistency that standard transcription provides.

Consistency makes language easier for us to read. Today, in the 21st century, consistency makes it easier for computers to understand what my friend's grandfather said.

And, when supplemented by comparison tables, consistency may help my friend learn to speak like his grandfather.

<< back to my multilingual blog