Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find similarity in document field MongoDB?

Given data that looks like this:

{'_id': 'foobar1',
 'about': 'similarity in comparison',
 'categories': ['one', 'two', 'three']}
{'_id': 'foobar2',
 'about': 'perfect similarity in comparison',
 'categories': ['one']}
{'_id': 'foobar3',
 'about': 'partial similarity',
 'categories': ['one', 'two']}
{'_id': 'foobar4',
 'about': 'none',
 'categories': ['one', 'two']}

I would like to find a way to get a similarity between a single item and all other items in the collection then return them in order of highest similarity. Similarity is based on number of words in common, there is already a function int similar(String one, String two)

For example: if I want the similarity list for about field of foobar1, it would return

[{'_id': 'foobar2'}, {'_id': 'foobar3'}, {'_id': 'foobar4'}]

I am doing this with morphia, but with just the mongoDB implementation, I could figure the rest out

like image 211
sicter Avatar asked Jul 14 '16 04:07

sicter


1 Answers

If you need to compute text similarity on the about field, one way to achieve this is to use text index.

For example (in the mongo shell), if you create a text index on the about field:

db.collection.createIndex({about: 'text'})

you could execute a query such as (example taken from https://docs.mongodb.com/manual/reference/operator/query/text/#sort-by-text-search-score):

db.collection.find({$text: {$search: 'similarity in comparison'}}, {score: {$meta: 'textScore'}}).sort({score: {$meta: 'textScore'}})

With your example documents, the query should return something like:

{
  "_id": "foobar1",
  "about": "similarity in comparison",
  "score": 1.5
}
{
  "_id": "foobar2",
  "about": "perfect similarity in comparison",
  "score": 1.3333333333333333
}
{
  "_id": "foobar3",
  "about": "partial similarity",
  "score": 0.75
}

which are sorted by decreasing similarity score. Please note that unlike your example result, document foobar4 is not returned because none of the queried words are present in foobar4.

Text indexes are considered a special type of index in MongoDB, and thus comes with some specific rules on its usage. For more details, please see:

  • Text Indexes
  • $text query operator
like image 195
kevinadi Avatar answered Oct 23 '22 02:10

kevinadi