Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query by relevance with different weight

I provided search function for products,

User can search by multiple tags,

For example user can search for "iphone,128G, usa"

If the search term matches in title, it will score 3 points,

If search term matches in tags, it will score 1 point.

How could I rewrite my current query to perform the result.

  • document 1 will get: 7 points
  • document 2 will get: 4 points

Sample document 1

"title": "iphone 6 128G",
"tag": [
  "usa",
  "golden",
]

Sample document 2

"title": "iphone 4 64G",
"tag": [
  "usa",
  "golden",
]

Current query

  collection.aggregate(
     {
      "$match" => {
          "tag":{ "$in"=> q_params },
      }
     },
     { "$unwind" => "$tag" },
     { "$match" => { "tag"=> { "$in"=> q_params } } },
     { "$group" => { "_id"=> {"title":"$title"},
        "points"=> { "$sum"=>1 } } },
     { "$sort" => { "points"=> -1 } }
  )
like image 932
user3675188 Avatar asked Jul 25 '15 08:07

user3675188


1 Answers

I think you are approaching this a little bit in the wrong way and asking too much of "fuzzy matching" from a database. Instead, consider this revised data sample:

db.items.insert([
    {
        "title": "iphone 6 128G",
        "tags": [
            "iphone",
            "iphone6",
            "128G",
            "usa",
            "golden",
        ]
    },
    {
        "title": "iphone 4 64G",
        "tags": [
            "iphone",
            "iphone4",
            "64G",
            "usa",
            "golden",
        ]
    }
])

Now then is you consider a "search phrase" like this:

"iphone4 128G usa"

Then you need to implement your own application logic ( not really a hard thing, just referencing master tags ) that expands into something like this:

var searchedTags = ["iphone", "iphone4", "128G", "usa"]

The you can contruct a pipeline query like this:

db.items.aggregate([
    { "$match": { "tags": { "$in": searchedTags } } },
    { "$project": {
        "title": 1,
        "tags": 1,
        "score": {
            "$let": {
                "vars": {
                    "matchSize":{ 
                       "$size": {
                           "$setIntersection": [
                               "$tags",
                               searchedTags
                           ]
                       }
                   }
                },
                "in": {
                    "$add": [
                       "$$matchSize",
                       { "$cond": [
                           { "$eq": [
                               "$$matchSize", 
                               { "$size": "$tags" }
                           ]},
                           "$$matchSize",
                           0
                       ]}
                    ]
                }
            }
        }
    }},
    { "$sort": { "score": -1 } }
])

Which returns these results:

{
    "_id" : ObjectId("55b3551164518e494632fa19"),
    "title" : "iphone 6 128G",
    "tags" : [
            "iphone",
            "iphone6",
            "128G",
            "usa",
            "golden"
    ],
    "score" : 3
}
{
    "_id" : ObjectId("55b3551164518e494632fa1a"),
    "title" : "iphone 4 64G",
    "tags" : [
            "iphone",
            "iphone4",
            "64G",
            "usa",
            "golden"
    ],
    "score" : 2
}

So the more "tags" matches wins all of the time.

But if the phrase was changed to something like this:

"iphone4 64G usa golden"

Which resulted in parsed tags like this:

var searchedTags = ["iphone", "iphone4", "64G", "usa", "golden"]

Then the same query pipeline produces this:

{
    "_id" : ObjectId("55b3551164518e494632fa1a"),
    "title" : "iphone 4 64G",
    "tags" : [
            "iphone",
            "iphone4",
            "64G",
            "usa",
            "golden"
    ],
    "score" : 10
}
{
    "_id" : ObjectId("55b3551164518e494632fa19"),
    "title" : "iphone 6 128G",
    "tags" : [
            "iphone",
            "iphone6",
            "128G",
            "usa",
            "golden"
    ],
    "score" : 3
}

Where not only did you get the benefit of more matches on the provided tags on one document than the other, but because one of the documents matched "all" of the tags provided there is an additional score boost, pushing it even further up the rankings than something that just matched the same number of tags.

To break that down, first consider that the $let expression there declares a "variable" for the element in the pipeline so we do not "repeat ourselves" by typing out the same expression for the resulting $$matchSize value in multiple places.

That variable itself is determined by working out the resulting array from the $setIntersection of the searchedTags array and the $tags array itself. The result of the "intersection" are just those items that match, which gives room to test the $size of that array.

So later while attributing the $size of that match to the "score", the other consideration is given via the ternary $cond to see if the $$matchSize is equal to the original length of $tags. Where it is true then the $$matchSize is added to itself ( score of double the "tags" length ) for being an "exact match" to the provided tags, otherwise the returned result of that condition is 0.

Processing those two numeric results with $add prodcues the end total "score" value for each document.


The main point of this is that the aggregation framework lacks the operators to do any sort of "Fuzzy match" on a string such as the title. Whist you can $regex match within a $match stage since this is basically a query operator, it will only "filter" results.

You can "mess around" with that, but really what you want for a regex is to get a numeric "score" for the terms that match. Such splitting ( though possible in other language regex operators ) is not really available, so it makes more sense to simply "tokenize" your "tags" for input and match them against the document "tags".

For a "database" ( which MongoDB primarily is ) this is a better solution. Or perhaps you can even combine that with the $text search operator to project it's own "score" value on the title with the combination of "parsed tags" logic as demonstrated here. Which gives even more validity to "exact matches".

It can be used in conjunction with the aggregation pipeline, but even in itself it does not provide bad results:

db.items.createIndex({ "title": "text" })

db.items.find({ 
    "$text": { "$search": "iphone 4 64G" } },
    { "score": { "$meta": "textScore" }}
).sort({ "score": { "$meta": "textScore" } })

Would produce:

{
    "_id" : ObjectId("55b3551164518e494632fa1a"),
    "title" : "iphone 4 64G",
    "tags" : [
            "iphone",
            "iphone4",
            "64G",
            "usa",
            "golden"
    ],
    "score" : 2
}
{
    "_id" : ObjectId("55b3551164518e494632fa19"),
    "title" : "iphone 6 128G",
    "tags" : [
            "iphone",
            "iphone6",
            "128G",
            "usa",
            "golden"
    ],
    "score" : 0.6666666666666666
}

But if you just want to send strings and do not want to be bothered with the "tokenizing" logic, and want other logic to attribute your "score", then look into dedicated text search engines which do this a whole lot better than the "text search" or even basic search capabilties of a primary function database like MongoDB.

like image 65
Blakes Seven Avatar answered Oct 23 '22 18:10

Blakes Seven