Natural Language


These pages provide two examples of how machine learning is used in natural language processing: the context in which words appear and machine translation.

In the first example, we develop lists of contextually similar words for the Sicilian language dictionary at Napizia, because any study of a natural language should begin with a good dictionary.

But these days most projects do not begin with a dictionary. Instead they drop raw text into a neural network and let the machine "learn" the meaning of words and phrases from context.

And that method often yields good results! All day long, the computers at major financial institutions are reading information in news reports and making profitable buy/sell decisions instantly. So we want to know what they're doing. And we want to know how we can improve upon their good results.

better information, better results

To know what they're doing, these pages take notes on machine learning and natural language processing. To improve upon their work, we will feed better information into those computer algorithms.

On the next page, we take notes on word context and word embedding models. And on the following page, we train one of those models on the pages and articles in Sicilian Wikipedia. Word embeddings form one part of neural machine translation, so our third page considers ways that we might create such a translator for the Sicilian language.

To build a better dictionary, we will improve upon their work by recognizing that some words – like pronouns, prepositions, articles and conjunctions – will appear in just about every context, so they cannot convey any information about context. By contrast, nouns, verbs, adjectives and adverbs convey a lot of information about context because they vary by context.

If our goal is to identify context, then we should remove context-free words from each sentence before training a word embedding model for our dictionary. We're going to train the model by having it read text a few words at a time, so we want the five words in that window to have as much context as possible.

And if our goal is to identify context, then we should also use our dictionary to convert each word to its lemma (e.g. the singular form of nouns, the infinitive form of verbs). That will reduce the total vocabulary, but increase the count of each word in that reduced vocabulary, thus increasing its information value.

So if we train our models on better information, our models will return better results. We will have a better Sicilian dictionary and we might make some money on Wall Street too.

Copyright © 2002-2020 Eryk Wdowiak