table of contents
Given a particular word, we can imagine different contexts in which the word might appear. For example, given the word "apple," the context might be a discussion of "fruit." And in reverse, if the context were "fruit," we might expect to see the word "apple."
Words may appear in many different contexts however, so someone trying to classify words by context would first have to develop a set of contexts, then they would have to choose the contexts for each word and how much weight to give to each context.
But we're not writing a thesaurus. A thesaurus groups words together by similarity of meaning. The task here is different. Here, our models look for similarity of context.
word embedding models
Word embedding models use large samples of written text to identify context. They produce a vector space in which the location of a word's vector represents its context, so that two words that tend to appear in similar contexts are represented by two vectors that lie close together. Oppositely, two words that appear in different contexts should be represented by two vectors that are nearly orthogonal to each other.
GloVe models start by counting the co-occurrences of each pair of words, then minimize a loss function in which the estimated weights are the logarithms of the co-occurrence probabilities.
By contrast, word2vec models do not use counts at all. Instead they train a neural network to make predictions. Given a window of surrounding words (i.e. context), the continuous bag of words (CBOW) model predicts what word might appear. The continuous skip-gram model makes the opposite prediction. Given a word, they predict what context might appear.
FastText models extend the logic of word2vec from words to characters. Instead of treating the word as the basic unit, they focus on the letters in the words. Because all words have letters, this property enables fastText models to generate word vectors for words that did not appear in the training corpus.
What the models have in common is that they produce an embedding. And after training a model, we can retrieve its word vectors and use cosine similarity to measure the distance between a pair of vectors. That distance tells us how similar their words are in context. By selecting the closest ones, we can prepare a list of related words for inclusion in a dictionary.
The next page develops a word embedding model from the text of Sicilian Wikipedia articles and uses it to measure the cosine similarity between a pair of words. It illustrates a general method that we can use to measure the cosine similarity between each pair of words in the Sicilian language. You can use the cosenu di sumigghianza tool at Napizia to view those measures.
Because Sicilian Wikipedia has far fewer articles than other language Wikipedias, there is less data to train on. So I tried to obtain more information from the available data by removing "stop words" — articles, prepositions, pronouns, conjunctions and most punctuation.
Removing that list of common words augments the number of times that remaining words will co-appear in the same context window, providing more information from the same dataset.
Ideally, we should also replace conjugated verbs with their infinitives, replace plural nouns and adjectives with their singular forms, etc. (i.e. lemmatization), but such an ambitious task will have to wait for another day.
Obtaining better information from available data is particularly important for languages like Sicilian that have few digital texts, but many passionate speakers. As discussed on the next page, there were only 39,000 unique words in the Sicilian Wikipedia articles. For comparison, Mortillaro's (1876) dictionary of the Sicilian language defines over 50,000 words and Nicotra's (1883) dictionary defines about 30,000.
We need better data to build a better dictionary. On these pages, we begin building that better dictionary.
Copyright © 2002-2019 Eryk Wdowiak