I'm interested in possible ways to model the cosine similarity algorithm using Solr. I have items which are assigned a vector, for example:
items = [
{ id: 1, vector: [0,0,0,2,3,0,0] },
{ id: 2, vector: [0,1,0,1,5,0,0] },
{ id: 3, vector: [2,3,0,0,0,1,0] },
{ id: 4, vector: [1,2,4,6,5,0,0] }
]
And a search vector to which the others need to be ranked.
Currently, I'm modeling this in ruby by running over all the items and assigning them a rank against the input vector. Here's the implementation of cosine similarity I'm using:
module SimilarityCalculator
def self.get_similarity(vector1, vector2)
dp = dot_product(vector1, vector2)
nm = normalize(vector1) * normalize(vector2)
dp / nm
end
private
def self.dot_product(vector1, vector2)
sum = 0.0
vector1.each_with_index { |val, i| sum += val * vector2[i] }
sum
end
def self.normalize(vector)
Math.sqrt(vector.inject(0.0) { |m,o| m += o**2 })
end
end
Then, to get a ranked list I would do something like the following:
ranked = []
search_vector = [1,0,0,3,5,0,0]
items.each do |item|
rank = SimilarityCalculator.get_similarity(search_vector, item.vector)
{ id: item.id, rank: rank }
end
I don't know enough about Solr to know how this would be modeled or even if it can, but I thought I'd throw it out there.
Lucene already uses a cosine similarity model, so the real question is: can you map your vectors into Lucene's vectors? And can you remove the norming etc. which Lucene does that you don't want?
You can always write your own scoring and analysis functions, so if you're willing to do some coding, the answer is a clear "yes". This might require more work than you want though.
For an option which may get you part of the way but doesn't require any coding: translate each dimension into the word "dim_n" and repeat it (or boost it) however many times is the vector's magnitude in that dimension. For example:
[1,2,0,1] ==> "dim_1 dim_2 dim_2 dim_4"
If your vectors are all of approximately the same size and evenly distributed across dimensions, this might be a very good approximation.
If you tell us more about the problem (e.g. do you really need to give Lucene vectors as input or can you give it text?) we might be able to find better solutions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With