table of contents
scripts for this page
In our discussion of similarity between documents, we assumed that language is just a "bag of words." Specifically, we assumed that grammar is unimportant and we assumed that words do not have synonyms or antonyms.
Analysis of a term-document matrix retains those assumptions, but its count-based evaluation methods help us identify the most frequently used words and help us identify the words associated with those frequently used words.
The resulting frequencies and correlation coefficients do not help us identify synonyms or antonyms, but analysis of a term-document matrix does help us classify documents and understand relationships among words.
Suppose we have four documents with the following text:
With those documents, we can build a term-document matrix (with documents in the rows and words in the columns) from the frequencies at which each word appears in each document.
After stripping out "stop words" (i.e. common words like: "the," "and," "my," "we," etc.), our term-document matrix might be:
From that matrix, we can compute the cosine similarity matrix and measure the similarity in word frequency for each document pair.
And from the term-document matrix, we can also use Ingo Feinerer's tm (text mining) package for R to find frequently used terms and other terms that are highly correlated with those terms.
> findFreqTerms(docTerms, lowfreq=3)
 "niece" "swimming" "weekend"
> findAssocs(docTerms, term = "niece" , corlimit = 0.10)
last likes spent went
0.33 0.33 0.33 0.33
Note however that findAssocs only returns words that are positively correlated with the search term. Words that are negatively correlated with the search term are omitted, as can be seen from the full correlation matrix below.
Copyright © 2002-2020 Eryk Wdowiak