Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does solr use cosine similarity?

I have written a small search engine as my weekly project. It is based upon cosine similarity between query vector and document vector. Vector is calculate using of tf-idf sores of tokens.
I have come to know about Apache Solr which is a full text search engine. My question is does solr use cosine similarity internally when rank search results?

like image 458
Haider Ali Avatar asked Jul 09 '14 18:07

Haider Ali


People also ask

What is SOLR based on?

Apache Solr (stands for Searching On Lucene w/ Replication) is a free, open-source search engine based on the Apache Lucene library. An Apache Lucene subproject, it has been available since 2004 and is one of the most popular search engines available today worldwide.

Why cosine similarity is not good?

This is known that the vanilla cosine similarity has one important drawback–the difference in rating scale between different users are not taken into account.

What is the relation between Lucene and Solr?

Solr is built on top of lucene to provide a search platform. SOLR is a wrapper over Lucene index. It is simple to understand: SOLR is car and Lucene is its engine. You just need to know how to drive car (SOLR) and also need to know few things of engine (Lucene) in case if there will be any issue in your car engine.


3 Answers

No. Solr uses something similar to cosine similarity, but not quite the same - there are some key differences.

If you visit that same link (https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) and scroll further down, you will see "Lucene Conceptual Scoring Formula" and "Lucene Practical Scoring Formula" that give more details.

Ignoring any index/query-time boosts, the following are some key differences:

1. Different document normalization factor

Instead of normalizing each document by the Euclidean norm of its tf-idf vector, it uses "doc-len-norm". For the default similarity measure (DefaultSimilairty) this is just 1/sqrt(number of terms in the doc) which basically equals 1/sqrt(sum(tf)) - i.e., where tf is the sum of the term counts in the doc - no squaring as with the Euclidean norm and the idf for each term is left out. Furthermore this value is rounded to a byte to save space. This will most often come out to a different value than the normalization factor as used for cosine similarity.

2. Extra "coord" boost

There is also an extra value multiplied onto the score equal to: the number of query terms matched in the document / the total number of terms in the query.

This gives an extra boost for fields (documents) matching more of the query terms, and may be of questionable value. This essentially is multiplying the tf-idf vector score with another inner product - the inner product of these vectors converted to boolean vectors (0 if it does not have the given term, 1 if it does) with the query vector only normalized by its Euclidean norm.

like image 124
Brian Avatar answered Nov 23 '22 06:11

Brian


Yes, Solr (which runs on top of Lucene) does use Cosine similarity. From the Lucene documentation:

VSM score of document d for query q is the Cosine Similarity of the weighted query vectors V(q) and V(d)

cosine-similarity(q,d) = V(q) · V(d) / |V(q)| |V(d)|

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

like image 30
John Petrone Avatar answered Nov 23 '22 07:11

John Petrone


If you're looking for actual vector similarity in Solr, there are two approaches: 1) use delimited payloads. There are a few plugins that implement this already, like https://github.com/moshebla/solr-vector-scoring and https://github.com/saaay71/solr-vector-scoring

2) use streaming expressions, which comes out of the box: https://lucene.apache.org/solr/guide/8_5/vector-math.html

The latter is slower, but more flexible.

like image 27
Radu Gheorghe Avatar answered Nov 23 '22 08:11

Radu Gheorghe