Natural Language

word embeddings

On this page, we develop word embeddings for the Sicilian language using the articles on Sicilian Wikipedia. Then we will use those embeddings to measure the cosine similarity between a pair of words.

We can also compute the matrix of cosine similarity for each pair of words in the vocabulary, so this page will conclude by computing that matrix. You can access it with the cosenu di sumigghianza tool or read a summary of the cosine similarity results tool at Napizia.

data and training

Our first task is to download the scnwiki-20190201-pages-articles.​xml.​bz2 file from Wikimedia Downloads.

Then, we modify a Perl script by Matt Mahoney to remove markup and other text that is not human language. The script outputs a single line of text in which the words are separated by tabs.

The script also removes a list of stop words — articles, prepositions, pronouns, conjunctions and most punctuation — so that the remaining words co-appear in the context window more often.

To train the model, we use the train_sg_cbow.py script provided by MXNet, so we modify the script's train function so that it calls nlp.data.TSVDataset(​args.data) to pass in our tab-delimited Wikipedia file. And from a shell, we run:

python3 train_sg_cbow.py --model skipgram --ngram-buckets 0 --data scnwiki-20190201.tsv

which trains the model and saves the model parameters to a directory called logs.

retrieve the vectors

Having estimated the model, we can now retrieve the word vectors. To start, we load the Wikipedia data again:

datafile = 'scnwiki-20190201.tsv'

data = nlp.data.TSVDataset(​datafile)

data, vocab, idx_to_counts = preprocess_dataset( data )

And let's take a moment to explore it. Out of the 2,184,842 total words in the corpus, there are 39,010 unique words:

>>> sum(idx_to_counts)

2184842

>>> len(vocab)

39010

Those are small numbers. It's not uncommon to run regressions with over 2.2 million observations, yet here we have less than 2.2 million total words. The dictionary on your shelf might have 50,000 definitions, yet here we only have 39,010 unique words.

The ten most frequently occurring words are:

>>> vocab.idx_to_token[:10]

['<BOS>', '<EOS>', '.', 'è', 'catigurìa', 'pruvincia', 'cumuni', 'comu', 'nun', 'fu']

The beginning and end of sequence markers occur most frequently, followed by the verb "è" (which means "is"). Sicilian Wikipedia's category marker was the next most frequent, followed by its markers for cities and provinces. At the end of the top ten list are the words: "comu" ("like"), "nun" ("not") and "fu" ("was").

Those ten words account for approximately 29 percent of the total number of words in the corpus.

>>> round( sum(idx_to_counts[:10]) / sum(idx_to_counts), 4 )

0.2852

Our next step is to load the model and its trained parameters:

embedding = model(​token_to_idx=​vocab.token_to_idx, output_dim=output_dim,
    batch_size=batch_size, num_negatives=​num_negatives,
    negatives_weights=​mx.nd.array(​idx_to_counts))

embedding.load_parameters(​parmfile)

Then we retrieve the word vectors:

wvecs = embedding.embedding_out.​weight.data()

which gives us a representation of context for each word in the vocabulary.

cosine similarity

With the word vectors in hand, we can now see which words appear in similar contexts. Two vectors that lie relatively close together should have a higher measure of cosine similarity than two vectors that lie far apart. So let's define a function to compute the cosine measure:

def cos_sim(wordx, wordy):
xx = wvecs[​vocab.token_to_idx[​wordx],]
yy = wvecs[​vocab.token_to_idx[​wordy],]
return nd.dot(xx, yy) / (nd.norm(xx) * nd.norm(yy))

and let's use it to find the similarity between "vinniri" ("sell") and "prezzu" ("price"):

>>> cos_sim('vinniri', 'prezzu')

[0.50711024]

<NDArray 1 @cpu(0)>

The measure is relatively high because "sell" and "price" commonly appear in the same context. After all, one sells a good or service for a price. For comparison, the similarity between "vinniri" ("sell") and "danza" ("dance") is much lower:

>>> cos_sim('vinniri', 'danza')

[0.35387993]

<NDArray 1 @cpu(0)>

The measure of similarity is lower because a dance is not something that one sells.

cosine similarity matrix

The function above returns the cosine measure for a pair of words. To build our dictionary, we need to compute the cosine measure for each pair of words, so we compute the cosine matrix by dividing the dot product of the word vectors by the dot product of the Euclidean norms:

def cos_mat( vecs ):
xtx = nd.dot( vecs , vecs.T )
nmx = nd.sqrt( nd.diag(xtx) ).reshape((-1,1))
cnm = nd.dot( nmx , nmx.T )
return xtx / cnm

Computing that matrix for the entire vocabulary of 39,000 words will consume a lot computer memory, but these days a laptop computer can handle it. Nonetheless, it may make sense to focus on the words that appear more than 100 times:

>>> idx_to_counts[2602:2604]

[100, 99]

so we will limit the cosine matrix to the 2603 most commonly occurring words:

slimit = len( np.array(idx_to_counts)[ np.array(idx_to_counts)>=100 ] )

svecs = wvecs[:slimit,]

cosmat = cos_mat( svecs )

Finally, we save the cosine matrix to a CSV file, so that we can search the measures with the cosenu di sumigghianza tool:

otfile = open(otcsv, 'w')

otfile.write(',')


with open(otcsv, 'a') as otfile:
writer = csv.writer(otfile)
writer.writerow(​vocab.idx_to_token[:slimit])
for rowname, rowdata in zip(vocab.idx_to_token, cosmat.asnumpy()):
    writer.writerow(​[rowname] + rowdata.tolist())


Copyright © 2002-2019 Eryk Wdowiak