Does solr use cosine similarity?

Tags:

I have written a small search engine as my weekly project. It is based upon cosine similarity between query vector and document vector. Vector is calculate using of tf-idf sores of tokens.
I have come to know about Apache Solr which is a full text search engine. My question is does solr use cosine similarity internally when rank search results?

458

asked Jul 09 '14 18:07

Haider Ali

3 Answers

No. Solr uses something similar to cosine similarity, but not quite the same - there are some key differences.

If you visit that same link (https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) and scroll further down, you will see "Lucene Conceptual Scoring Formula" and "Lucene Practical Scoring Formula" that give more details.

Ignoring any index/query-time boosts, the following are some key differences:

1. Different document normalization factor

Instead of normalizing each document by the Euclidean norm of its tf-idf vector, it uses "doc-len-norm". For the default similarity measure (DefaultSimilairty) this is just 1/sqrt(number of terms in the doc) which basically equals 1/sqrt(sum(tf)) - i.e., where tf is the sum of the term counts in the doc - no squaring as with the Euclidean norm and the idf for each term is left out. Furthermore this value is rounded to a byte to save space. This will most often come out to a different value than the normalization factor as used for cosine similarity.

2. Extra "coord" boost

There is also an extra value multiplied onto the score equal to: the number of query terms matched in the document / the total number of terms in the query.

This gives an extra boost for fields (documents) matching more of the query terms, and may be of questionable value. This essentially is multiplying the tf-idf vector score with another inner product - the inner product of these vectors converted to boolean vectors (0 if it does not have the given term, 1 if it does) with the query vector only normalized by its Euclidean norm.

124

answered Nov 23 '22 06:11

Brian

Yes, Solr (which runs on top of Lucene) does use Cosine similarity. From the Lucene documentation:

VSM score of document d for query q is the Cosine Similarity of the weighted query vectors V(q) and V(d)

cosine-similarity(q,d) = V(q) · V(d) / |V(q)| |V(d)|

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

answered Nov 23 '22 07:11

John Petrone

If you're looking for actual vector similarity in Solr, there are two approaches: 1) use delimited payloads. There are a few plugins that implement this already, like https://github.com/moshebla/solr-vector-scoring and https://github.com/saaay71/solr-vector-scoring

2) use streaming expressions, which comes out of the box: https://lucene.apache.org/solr/guide/8_5/vector-math.html

The latter is slower, but more flexible.

answered Nov 23 '22 08:11

Radu Gheorghe

Related questions
                            
                                solr extendeDisMax parser pf and qf parameter difference
                            
                                failed to start dse solr node
                            
                                How to delete the documents a month ago
                            
                                SOLR index size reduction
                            
                                Which words appear the most common in an indexed field?
                            
                                Solr Change CommonsHttpSolrServer To HttpSolrServer
                            
                                How to get the suggester component working in SolrNet?
                            
                                Solr4 currently only looks at the default "df" field, how can we search multiple fields?
                            
                                Distinct Results from Solr Query [duplicate]
                            
                                How to ignore accent search in Solr
                            
                                SOLR Difference between indexed=true and stored=true
                            
                                Data-config.xml and mysql - I can load only "id" column
                            
                                Cannot resolve reference to bean while setting bean property 'userDetailsService'
                            
                                How to setup Lucene/Solr for a B2B web app?
                            
                                Solr: strip punctuation before index
                            
                                Wildcard to select all items in Solr
                            
                                Datastore solution for tag search
                            
                                How to perform a search in a Multivalued Field in Solr?
                            
                                SolrJ HttpSolrServer throwing NoHttpResponseException during instantiation
                            
                                solr dataimport from mysql dies when mysq query limit is removed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does solr use cosine similarity?

Tags:

solr

lucene

search-engine