Is it possible to model cosine similarity in Solr/Lucene?

Question

I'm interested in possible ways to model the cosine similarity algorithm using Solr. I have items which are assigned a vector, for example:

items = [
  { id: 1, vector: [0,0,0,2,3,0,0] },
  { id: 2, vector: [0,1,0,1,5,0,0] },
  { id: 3, vector: [2,3,0,0,0,1,0] },
  { id: 4, vector: [1,2,4,6,5,0,0] }
]

And a search vector to which the others need to be ranked.

Currently, I'm modeling this in ruby by running over all the items and assigning them a rank against the input vector. Here's the implementation of cosine similarity I'm using:

module SimilarityCalculator

  def self.get_similarity(vector1, vector2)
    dp = dot_product(vector1, vector2)
    nm = normalize(vector1) * normalize(vector2)
    dp / nm
  end

  private

  def self.dot_product(vector1, vector2)
    sum = 0.0
    vector1.each_with_index { |val, i| sum += val * vector2[i] }
    sum
  end

  def self.normalize(vector)
    Math.sqrt(vector.inject(0.0) { |m,o| m += o**2 })
  end

end

Then, to get a ranked list I would do something like the following:

ranked = []
search_vector = [1,0,0,3,5,0,0]
items.each do |item|
  rank = SimilarityCalculator.get_similarity(search_vector, item.vector)
  { id: item.id, rank: rank }
end

I don't know enough about Solr to know how this would be modeled or even if it can, but I thought I'd throw it out there.

Xodarap · Accepted Answer

Lucene already uses a cosine similarity model, so the real question is: can you map your vectors into Lucene's vectors? And can you remove the norming etc. which Lucene does that you don't want?

You can always write your own scoring and analysis functions, so if you're willing to do some coding, the answer is a clear "yes". This might require more work than you want though.

For an option which may get you part of the way but doesn't require any coding: translate each dimension into the word "dim_n" and repeat it (or boost it) however many times is the vector's magnitude in that dimension. For example:

[1,2,0,1] ==> "dim_1 dim_2 dim_2 dim_4"

If your vectors are all of approximately the same size and evenly distributed across dimensions, this might be a very good approximation.

If you tell us more about the problem (e.g. do you really need to give Lucene vectors as input or can you give it text?) we might be able to find better solutions.

Is it possible to model cosine similarity in Solr/Lucene?

Tags:

ruby

solr

lucene

cosine-similarity

JC Grubbs

1 Answers

Xodarap

Recent Activity

Donate For Us

Is it possible to model cosine similarity in Solr/Lucene?

Tags:

ruby

solr

lucene

cosine-similarity

JC Grubbs

1 Answers

Xodarap

Related questions

Recent Activity

Donate For Us