Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to model cosine similarity in Solr/Lucene?

I'm interested in possible ways to model the cosine similarity algorithm using Solr. I have items which are assigned a vector, for example:

items = [
  { id: 1, vector: [0,0,0,2,3,0,0] },
  { id: 2, vector: [0,1,0,1,5,0,0] },
  { id: 3, vector: [2,3,0,0,0,1,0] },
  { id: 4, vector: [1,2,4,6,5,0,0] }
]

And a search vector to which the others need to be ranked.

Currently, I'm modeling this in ruby by running over all the items and assigning them a rank against the input vector. Here's the implementation of cosine similarity I'm using:

module SimilarityCalculator

  def self.get_similarity(vector1, vector2)
    dp = dot_product(vector1, vector2)
    nm = normalize(vector1) * normalize(vector2)
    dp / nm
  end

  private

  def self.dot_product(vector1, vector2)
    sum = 0.0
    vector1.each_with_index { |val, i| sum += val * vector2[i] }
    sum
  end

  def self.normalize(vector)
    Math.sqrt(vector.inject(0.0) { |m,o| m += o**2 })
  end

end

Then, to get a ranked list I would do something like the following:

ranked = []
search_vector = [1,0,0,3,5,0,0]
items.each do |item|
  rank = SimilarityCalculator.get_similarity(search_vector, item.vector)
  { id: item.id, rank: rank }
end

I don't know enough about Solr to know how this would be modeled or even if it can, but I thought I'd throw it out there.

like image 421
JC Grubbs Avatar asked Nov 13 '22 10:11

JC Grubbs


1 Answers

Lucene already uses a cosine similarity model, so the real question is: can you map your vectors into Lucene's vectors? And can you remove the norming etc. which Lucene does that you don't want?

You can always write your own scoring and analysis functions, so if you're willing to do some coding, the answer is a clear "yes". This might require more work than you want though.

For an option which may get you part of the way but doesn't require any coding: translate each dimension into the word "dim_n" and repeat it (or boost it) however many times is the vector's magnitude in that dimension. For example:

[1,2,0,1] ==> "dim_1 dim_2 dim_2 dim_4"

If your vectors are all of approximately the same size and evenly distributed across dimensions, this might be a very good approximation.

If you tell us more about the problem (e.g. do you really need to give Lucene vectors as input or can you give it text?) we might be able to find better solutions.

like image 147
Xodarap Avatar answered Dec 19 '22 16:12

Xodarap