Cosine Similarity of Vectors of different lengths?

Tags:

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying:

#len(u)==201, len(v)==246

cosine_distance(u, v)
ValueError: objects are not aligned

#this works though:
cosine_distance(u[:200], v[:200])
>> 0.52230249969265641

Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with vectors of different lengths.

I'm using this function:

def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

Also -- is the order of the tf_idf values in the vectors important? Should they be sorted -- or is it of no importance for this calculation?

986

asked Jun 25 '10 20:06

erikcw

1 Answers

You need multiply the entries for corresponding words in the vector, so there should be a global order for the words. This means that in theory your vectors should be the same length.

In practice, if one document was seen before the other, words in the second document may have been added to the global order after the first document was seen, so even though the vectors have the same order, the first document may be shorter, since it doesn't have entries for the words that weren't in that vector.

Document 1: The quick brown fox jumped over the lazy dog.

Global order:     The quick brown fox jumped over the lazy dog
Vector for Doc 1:  1    1     1    1     1     1    1   1   1

Document 2: The runner was quick.

Global order:     The quick brown fox jumped over the lazy dog runner was
Vector for Doc 1:  1    1     1    1     1     1    1   1   1
Vector for Doc 2:  1    1     0    0     0     0    0   0   0    1     1

In this case, in theory you need to pad the Document 1 vector with zeroes on the end. In practice, when computing the dot product, you only need to multiply elements up to the end of Vector 1 (since omitting the extra elements of vector 2 and multiplying them by zero are exactly the same, but visiting the extra elements is slower).

Then you can compute the magnitude of each vector separately, and for that the vectors don't need to be of the same length.

166

answered Sep 23 '22 19:09

Ken Bloom

Related questions
                            
                                Is there a better way to perform multiple output with Dash by Plotly?
                            
                                Make a pause between images display in openCV
                            
                                Django bulk create objects from QuerySet
                            
                                How can we call one route from another route with parameters in Flask?
                            
                                Schedule to start an EC2 instance and run a python script within it
                            
                                How to create a copy of a dataframe in pyspark?
                            
                                PyCharm can't find import in same folder
                            
                                Python Decorator as Callback in Dash Using Dash Object That is an Instance Variable - Fails
                            
                                Training a Keras model yields multiple optimizer errors
                            
                                ConversionError: Failed to convert value(s) to axis units
                            
                                TypedDict when keys have invalid names
                            
                                Encountering " WARN ProcfsMetricsGetter: Exception when trying to compute pagesize" error when running Spark
                            
                                Certifacte verify failed: certificate has expired (_ssl.c:1108)
                            
                                How do you get default headers in a urllib2 Request?
                            
                                Python thread exit code
                            
                                Can bin() be overloaded like oct() and hex() in Python 2.6?
                            
                                Python: prefer several small modules or one larger module? [closed]
                            
                                exit from ipython
                            
                                SFTP using ftplib
                            
                                How to run two functions simultaneously

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cosine Similarity of Vectors of different lengths?

Tags:

python

nlp

similarity

nltk

tf-idf

erikcw

People also ask

1 Answers

Ken Bloom

Recent Activity

Donate For Us