tf-idf documents of different length

Tags:

i have searched the web about normalizing tf grades on cases when the documents' lengths are very different (for example, having the documents lengths vary from 500 words to 2500 words)

the only normalizing i've found talk about dividing the term frequency in the length of the document, hence causing the length of the document to not have any meaning.

this method though is a really bad one for normalizing tf. if any, it causes the tf grades for each document to have a very large bias (unless all documents are constructed from pretty much the same dictionary, which is not the case when using tf-idf)

for example lets take 2 documents - one consisting of 100 unique words, and the other of 1000 unique words. each word in doc1 will have a tf of 0.01 while in doc2 each word will have a tf of 0.001

this causes tf-idf grades to automatically be bigger when matching words with doc1 than doc2

have anyone got any suggustion of a more suitable normalizing formula?

thank you

edit i also saw a method stating we should divide the term frequency with the maximum term frequency of the doc for each doc this also isnt solving my problem

what i was thinking, is calculating the maximum term frequency from all the documents and then normalizing all of the terms by dividing each term frequency with the maximum

would love to know what you think

763

asked Sep 26 '16 13:09

Shahaf Stein

1 Answers

What is the goal of your analysis?

If your end goal is to compare similarity between documents (et simila) you should not bother about document length at the tfidf calculation stage. Here is why.

The tfidf represents your documents in a common vector space. If you then calculate the cosine similarity between these vectors, the cosine similarity compensates for the effect of different documents' length. The reason is that the cosine similarity evaluates the orientation of the vectors and not their magnitude. I can show you the point with python: Consider the following (dumb) documents

document1 = "apple apple banana"
document2 = "apple apple apple apple banana banana"

documents = (
    document1,
    document2)

The length of these documents is different but their content is identical. More precisely, the relative distributions of terms in the two documents is identical but the absolute term frequencies are not.

Now, we use tfidf to represent these documents in a common vector space:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

And we use the cosine similarity to evaluate the similarity of these vectorized documents by looking just at their directions (or orientations) without caring about their magnitudes (that is, their length). I am evaluating cosine similarity between document one and document two:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

The result is 1. Remember that the cosine similarity between two vectors equals 1 when the two vectors have exactly the same orientation, 0 when they are orthogonal and -1 when the vectors have the opposite orientation.

In this case, you can see that cosine similarity is not affected by the length of the documents and is capturing the fact that the relative distribution of terms in you original documents is identical! If you want to express this information as a "distance" between documents, then you can simply do:

1 - cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

This value will tend to 0 when the documents are similar (regardless of their length) and to 1 when they are dissimilar.

190

answered Sep 26 '22 01:09

helium

Related questions
                            
                                VariableDoesNotExist: Failed lookup for key [val2] in u'None'
                            
                                Scipy: distance correlation is higher than 1
                            
                                How to check when is the last time S3 bucket has been updated?
                            
                                what is my openssl and ssl Default CA Certs Path?
                            
                                Getting last modified for every file in a directory
                            
                                Ignoring non-numerical string values in pandas dataframe
                            
                                Django form: setting initial value on DateField
                            
                                All possible maximum matchings of a bipartite graph
                            
                                Unable to update nested dictionary value in multiprocessing's manager.dict()
                            
                                Cannot import name include [closed]
                            
                                Fill Pyspark dataframe column null values with average value from same column
                            
                                duplicate a tensorflow graph
                            
                                Mock_open CSV file not getting any data
                            
                                django-storages get the full S3 url
                            
                                TypeError: must be unicode, not str in NLTK
                            
                                Django: DRY principle and UserPassesTestMixin
                            
                                Automatic db router for migrations in django
                            
                                How to skip an unknown number of empty lines before header on pandas.read_csv?
                            
                                matplotlib.use required before other imports clashes with pep8. Ignore or fix?
                            
                                pandas histogram: plot histogram for each column as subplot of a big figure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

tf-idf documents of different length

Tags:

python

normalization

textblob

tf-idf

Shahaf Stein

People also ask

1 Answers

helium

Recent Activity

Donate For Us