Elasticsearch "More Like This" API vs. more_like_this query

Tags:

Elasticsearch has two similar features to get "similar" documents:

There is the "More Like This API". It gives me documents similar to a given one. I can't use it in more complex expressions though.

There is also the "more_like_this" query for use in the Search API I can use it in bool or boosting expressions, but I can't give it an id of a document. I have to provide the "like_text" parameter.

I have documents with tags and content. Some documents will have good tags and some won't have any. I want a "Similar documents" feature that will work every time but will rank documents with matching tags higher than documents with matching text. My idea was:

{
    "boosting" : {
        "positive" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
                "min_term_freq" : 1
            }
        },
        "negative" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
            }
        },
        "negative_boost" : 0.2
    }
}

Obviously this doesn't work because there is no "id" in "more_like_this". What are the alternatives?

634

asked Mar 08 '13 18:03

Antoni Myłka

2 Answers

First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.

In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.

The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms (default 25) will be kept, out of the terms that have at least the provided min_term_freq (minimum term frequency, default 2) and document frequency between min_doc_freq and max_doc_freq. There are more parameters too that can influence the generated query.

The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.

Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like. On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.

In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.

There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.

{
    "bool" : {
        "must" : {
          {"match_all" : { }}
        },
        "should" : [
            {
              "more_like_this_field" : {
                "tags" : {
                  "like_text" : "here go the tags extracted from the current document!",
                  "boost" : 2.0
                }
              }
            },
            {
              "more_like_this_field" : {
                "content" : {
                  "like_text" : "here goes the content extracted from the current document!"
                }
              }
            }
        ],
        "minimum_number_should_match" : 1
    }
}

This way at least one of the should clauses must match, and a match on tags is more important than a match on content.

answered Nov 16 '22 07:11

javanna

This is possible now with the new like syntax:

{
    "more_like_this" : {
        "fields" : ["title", "description"],
        "like" : [
        {
            "_index" : "imdb",
            "_type" : "movies",
            "_id" : "1"
        },
        {
            "_index" : "imdb",
            "_type" : "movies",
            "_id" : "2"
        }],
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}

See here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

answered Nov 16 '22 08:11

Datageek

Related questions
                            
                                osticket api get tickets
                            
                                swift 3 posting json parameter to api
                            
                                Newtonsoft.Json parses incorrect json
                            
                                JQ: Reduce array of objects to object, adding to array
                            
                                Converting json to nested postgres composite type
                            
                                How to delete all nested keys with JQ
                            
                                Place markers from JSON data for Google MAPS API v3
                            
                                How to define property of type percentage in Json Schema 4
                            
                                Any way to run the scripts in project.json just like in package.json?
                            
                                Linq query producing incorrect result
                            
                                Swagger documentation for facebook graph api
                            
                                Combine two object in RxJS
                            
                                field must be of BSON type object
                            
                                d3 zoom function issues in v4
                            
                                Convert array into key / value pairs
                            
                                How deserialize based on information available in the parent class
                            
                                How to ignore null objects when writing JSON with JsonCpp
                            
                                Using jq to extract values in JSON array with a particular key boolean == true?
                            
                                Json does not exist in the namespace System
                            
                                jQuery: handle errors in getJSON()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Elasticsearch "More Like This" API vs. more_like_this query

Tags:

json

rest

elasticsearch

Antoni Myłka

People also ask

2 Answers

javanna

Datageek

Recent Activity

Donate For Us