Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tf-idf documents of different length

i have searched the web about normalizing tf grades on cases when the documents' lengths are very different (for example, having the documents lengths vary from 500 words to 2500 words)

the only normalizing i've found talk about dividing the term frequency in the length of the document, hence causing the length of the document to not have any meaning.

this method though is a really bad one for normalizing tf. if any, it causes the tf grades for each document to have a very large bias (unless all documents are constructed from pretty much the same dictionary, which is not the case when using tf-idf)

for example lets take 2 documents - one consisting of 100 unique words, and the other of 1000 unique words. each word in doc1 will have a tf of 0.01 while in doc2 each word will have a tf of 0.001

this causes tf-idf grades to automatically be bigger when matching words with doc1 than doc2

have anyone got any suggustion of a more suitable normalizing formula?

thank you

edit i also saw a method stating we should divide the term frequency with the maximum term frequency of the doc for each doc this also isnt solving my problem

what i was thinking, is calculating the maximum term frequency from all the documents and then normalizing all of the terms by dividing each term frequency with the maximum

would love to know what you think

like image 763
Shahaf Stein Avatar asked Sep 26 '16 13:09

Shahaf Stein


People also ask

What are two limitations of the TF-IDF representation?

However, TF-IDF has several limitations: – It computes document similarity directly in the word-count space, which may be slow for large vocabularies. – It assumes that the counts of different words provide independent evidence of similarity. – It makes no use of semantic similarities between words.

Can TF-IDF be more than 1?

You may notice that the product of TF and IDF can be above 1. Now, the last step is to normalize these values so that TF-IDF values always scale between 0 and 1.

What is the TF-IDF value in a document?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...

Why is TF-IDF a good way to score how relevant documents are?

TF-IDF enables us to gives us a way to associate each word in a document with a number that represents how relevant each word is in that document. Then, documents with similar, relevant words will have similar vectors, which is what we are looking for in a machine learning algorithm.


1 Answers

What is the goal of your analysis?

If your end goal is to compare similarity between documents (et simila) you should not bother about document length at the tfidf calculation stage. Here is why.

The tfidf represents your documents in a common vector space. If you then calculate the cosine similarity between these vectors, the cosine similarity compensates for the effect of different documents' length. The reason is that the cosine similarity evaluates the orientation of the vectors and not their magnitude. I can show you the point with python: Consider the following (dumb) documents

document1 = "apple apple banana"
document2 = "apple apple apple apple banana banana"

documents = (
    document1,
    document2)

The length of these documents is different but their content is identical. More precisely, the relative distributions of terms in the two documents is identical but the absolute term frequencies are not.

Now, we use tfidf to represent these documents in a common vector space:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

And we use the cosine similarity to evaluate the similarity of these vectorized documents by looking just at their directions (or orientations) without caring about their magnitudes (that is, their length). I am evaluating cosine similarity between document one and document two:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

The result is 1. Remember that the cosine similarity between two vectors equals 1 when the two vectors have exactly the same orientation, 0 when they are orthogonal and -1 when the vectors have the opposite orientation.

In this case, you can see that cosine similarity is not affected by the length of the documents and is capturing the fact that the relative distribution of terms in you original documents is identical! If you want to express this information as a "distance" between documents, then you can simply do:

1 - cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

This value will tend to 0 when the documents are similar (regardless of their length) and to 1 when they are dissimilar.

like image 190
helium Avatar answered Sep 26 '22 01:09

helium