Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch More Like This Query

I'm trying wrap my mind around how the more like this query works, and I seem to be missing something. I read the documentation, but the ES documentation is often somewhat...lacking.

The goal is to be able to limit results by term frequency, as attempted here.

So I set up a simple index, including term vectors for debugging, then added two simple docs.

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "doc": {
         "properties": {
            "text": {
               "type": "string",
               "term_vector": "yes"
            }
         }
      }
   }
}

PUT /test_index/doc/1
{
    "text": "apple, apple, apple, apple, apple"
}

PUT /test_index/doc/2
{
    "text": "apple, apple"
}

When I look at the termvectors I see what I expect:

GET /test_index/doc/1/_termvector
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 5
            }
         }
      }
   }
}

GET /test_index/doc/2/_termvector
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "2",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 2
            }
         }
      }
   }
}

When I run the following query with "min_term_freq": 1 I get back both docs:

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 1,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.5816214,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.5816214,
            "_source": {
               "text": "apple, apple, apple, apple, apple"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.5254995,
            "_source": {
               "text": "apple, apple"
            }
         }
      ]
   }
}

But if I increase "min_term_freq" to 2 (or more) I get nothing, though I would expect both documents to be returned:

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

Why? What am I missing?

If I want to set up a query that would only return the document in which "apple" appears 5 times, but not the one in which it appears 2 times, is there a better way?

Here is the code, for convenience:

http://sense.qbox.io/gist/341f9f77a6bd081debdcaa9e367f5a39be9359cc

like image 424
Sloan Ahrens Avatar asked Feb 03 '15 20:02

Sloan Ahrens


2 Answers

The min term frequency and min doc frequency are actually applied on the input before doing the MLT. Which means as you have only one occurrence of apple in your input text , apple was never qualified for MLT as min term frequency is set to 2. If you change your input to "apple apple" as below , things will work -

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}

Same goes for min doc frequency too. Apple is found in atleast 2 document , so min_doc_freq upto 2 will qualify apply from input text for MLT operations.

like image 78
Vineeth Mohan Avatar answered Sep 21 '22 02:09

Vineeth Mohan


As the poster of this question, I was trying to wrap my mind around the more_like_this query, too...

I struggled a bit to find good sources of information on the web, but (as in most cases) documentation seems to help the most, so, here's the link to the documentation, and some more important terms (and/or a bit more difficult to understand, so I added my interpretation):

max_query_terms - The maximum number of query terms that will be selected (from each input document). Increasing this value gives greater accuracy at the expense of query execution speed. Defaults to 25.

min_term_freq - The minimum term frequency below which the terms will be ignored from the input document. Defaults to 2.

If the term appears in the input document less than 2 (default) times, it will be ignored from the input document, i.e. not be searched for in other possible more_like_this documents.

min_doc_freq - The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.

This one took me a second to get, so, here's my interpretation:

In how many documents a term from the input document must appear in order to be selected as a query term.

There it is, I hope I saved someone a few minutes of his life. :)

Cheers!

like image 21
Filip Savic Avatar answered Sep 21 '22 02:09

Filip Savic