I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared. Here's an example with two text articles: <ul> <li>Article I) "A fox jumps over another fox."</li> <li>Article II) "A hunter saw a fox."</li> </ul> Article I gets split into words (afters stemming and dropping stopwords): <ul> <li>["fox", "jump", "another", "fox"].</li> </ul> Article II gets split into words: <ul> <li>["hunter", "see", "fox"].</li> </ul> These two articles produce the following word-frequency and document-frequency counters: <ul> <li> <code>fox</code> (word-frequency: 3, document-frequency: 2)</li> <li> <code>jump</code> (word-frequency: 1, document-frequency: 1) </li> <li> <code>another</code> (word-frequency: 1, document-frequency: 1) </li> <li> <code>hunter</code> (word-frequency: 1, document-frequency: 1) </li> <li> <code>see</code> (word-frequency: 1, document-frequency: 1) </li> </ul> Given a new text article, how do I measure how similar this article is to previous articles? I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters. For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen? Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0. I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?

A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1). This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C. <pre class="prettyprint"><code>P(C|D) = P(D|C) * P(D) (1) </code></pre> Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as <pre class="prettyprint"><code>P(D|C) = \prod_{t \in D} P(t|C) (2) </code></pre> P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports. When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category. It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class). For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t. Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.

What's a good measure for classifying text documents?

Tags:

text

measure

words

nlp

similarity

I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.

Here's an example with two text articles:

Article I) "A fox jumps over another fox."
Article II) "A hunter saw a fox."

Article I gets split into words (afters stemming and dropping stopwords):

["fox", "jump", "another", "fox"].

Article II gets split into words:

["hunter", "see", "fox"].

These two articles produce the following word-frequency and document-frequency counters:

fox (word-frequency: 3, document-frequency: 2)
jump (word-frequency: 1, document-frequency: 1)
another (word-frequency: 1, document-frequency: 1)
hunter (word-frequency: 1, document-frequency: 1)
see (word-frequency: 1, document-frequency: 1)

Given a new text article, how do I measure how similar this article is to previous articles?

I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.

For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?

Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.

I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?

826

asked May 10 '18 02:05

bodacydo

3 Answers

A standard solution is to apply the Naive Bayes classifier which estimates the posterior probability of a class C given a document D, denoted as P(C=k|D) (for a binary classification problem, k=0 and 1).

This is estimated by computing the priors from a training set of class labeled documents, where given a document D we know its class C.

P(C|D) = P(D|C) * P(D)              (1)

Naive Bayes assumes that terms are independent, in which case you can write P(D|C) as

P(D|C) = \prod_{t \in D} P(t|C)     (2)

P(t|C) can simply be computed by counting how many times does a term occur in a given class, e.g. you expect that the word football will occur a large number of times in documents belonging to the class (category) sports.

When it comes to the other factor P(D), you can estimate it by counting how many labeled documents are given from each class, may be you have more sports articles than finance ones, which makes you believe that there is a higher likelihood of an unseen document to be classified into the sports category.

It is very easy to incorporate factors, such as term importance (idf), or term dependence into Equation (1). For idf, you add it as a term sampling event from the collection (irrespective of the class). For term dependence, you have to plugin probabilities of the form P(u|C)*P(u|t), which means that you sample a different term u and change (transform) it to t.

Standard implementations of Naive Bayes classifier can be found in the Stanford NLP package, Weka and Scipy among many others.

answered Sep 18 '22 22:09

Debasis

It seems that you are trying to answer several related questions:

How to measure similarity between documents A and B? (Metric learning)
How to measure how unusual document C is, compared to some collection of documents? (Anomaly detection)
How to split a collection of documents into groups of similar ones? (Clustering)
How to predict to which class a document belongs? (Classification)

All of these problems are normally solved in 2 steps:

Extract the features: Document --> Representation (usually a numeric vector)
Apply the model: Representation --> Result (usually a single number)

There are lots of options for both feature engineering and modeling. Here are just a few.

Feature extraction

Bag of words: Document --> number of occurences of each individual word (that is, term frequencies). This is the basic option, but not the only one.
Bag of n-grams (on word-level or character-level): co-occurence of several tokens is taken into account.
Bag of words + grammatic features (e.g. POS tags)
Bag of word embeddings (learned by an external model, e.g. word2vec). You can use embedding as a sequence or take their weighted average.
Whatever you can invent (e.g. rules based on dictionary lookup)...

Features may be preprocessed in order to decrease relative amount of noise in them. Some options for preprocessing are:

dividing by IDF, if you don't have a hard list of stop words or believe that words might be more or less "stoppy"
normalizing each column (e.g. word count) to have zero mean and unit variance
taking logs of word counts to reduce noise
normalizing each row to have L2 norm equal to 1

You cannot know in advance which option(s) is(are) best for your specific application - you have to do experiments.

Now you can build the ML model. Each of 4 problems has its own good solutions.

For classification, the best studied problem, you can use multiple kinds of models, including Naive Bayes, k-nearest-neighbors, logistic regression, SVM, decision trees and neural networks. Again, you cannot know in advance which would perform best.

Most of these models can use almost any kind of features. However, KNN and kernel-based SVM require your features to have special structure: representations of documents of one class should be close to each other in sense of Euclidean distance metric. This sometimes can be achieved by simple linear and/or logarithmic normalization (see above). More difficult cases require non-linear transformations, which in principle may be learned by neural networks. Learning of these transformations is something people call metric learning, and in general it is an problem which is not yet solved.

The most conventional distance metric is indeed Euclidean. However, other distance metrics are possible (e.g. manhattan distance), or different approaches, not based on vector representations of texts. For example, you can try to calculate Levenstein distance between texts, based on count of number of operations needed to transform one text to another. Or you can calculate "word mover distance" - the sum of distances of word pairs with closest embeddings.

For clustering, basic options are K-means and DBScan. Both these models require your feature space have this Euclidean property.

For anomaly detection you can use density estimations, which are produced by various probabilistic algorithms: classification (e.g. naive Bayes or neural networks), clustering (e.g. mixture of gaussian models), or other unsupervised methods (e.g. probabilistic PCA). For texts, you can exploit the sequential language structure, estimating probabilitiy of each word conditional on the previous words (using n-grams or convolutional/recurrent neural nets) - this is called language models, and it is usually more efficient than bag-of-word assumption of Naive Bayes, which ignores word order. Several language models (one for each class) may be combined into one classifier.

Whatever problem you solve, it is strongly recommended to have a good test set with the known "ground truth": which documents are close to each other, or belong to the same class, or are (un)usual. With this set, you can evaluate different approaches to feature engineering and modelling, and choose the best one.

If you don't have resourses or willingness to do multiple experiments, I would recommend to choose one of the following approaches to evaluate similarity between texts:

word counts + idf normalization + L2 normalization (equivalent to the solution of @mcoav) + Euclidean distance
mean word2vec embedding over all words in text (the embedding dictionary may be googled up and downloaded) + Euclidean distance

Based on one of these representations, you can build models for the other problems - e.g. KNN for classifications or k-means for clustering.

answered Sep 18 '22 22:09

David Dale

I would suggest tf-idf and cosine similarity.

You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.

If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.

answered Sep 19 '22 22:09

mcoav

Related questions
                            
                                WPF UIElements Inline with Text "Adornments"
                            
                                What is a minimal set of unicode characters for reasonable Japanese support?
                            
                                Java profiling - How reliable are the values it gives?
                            
                                Get Editable id in afterTextChanged event
                            
                                Python 3: Searching A Large Text File With REGEX
                            
                                How to remove text between two elements with jQuery
                            
                                Python: position text box fixed in corner and correctly aligned
                            
                                golang: print text in the image
                            
                                How to skip multiple directories when doing a find
                            
                                OpenCV - Remove text from image
                            
                                Put text into non-active Axes in MATLAB
                            
                                How align text in <caption> table to the left?
                            
                                Using tk to create a text editor
                            
                                SwiftUI position text bottom right?
                            
                                XPath select innertext
                            
                                How to append content of one textfile to another textfile using batchscript?
                            
                                AutoHotKey: Instant text replace
                            
                                FPDF: Change text color while inside a Cell?
                            
                                How do I Replace a String in a Line of a Text File Using FileSystemObject in VBA?
                            
                                Notepad++ column mode: Go To Last Line [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With