Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: Get phrase frequency in a given document

Test data:

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '{ "body": "this is a test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "and this is another test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "this thing is a test" }'

My goal is to get the frequency of a phrase in a document.

I know how to get the frequency of the terms in a document:

curl -g "http://localhost:9200/customer/external/1/_termvectors?pretty" -d'
{
        "fields": ["body"],
        "term_statistics" : true
}'

And I know how to count the documents that contains a given phrase (with match_phrase or span_near query):

curl -g "http://localhost:9200/customer/_count?pretty" -d'
{
  "query": {
    "match_phrase": {
      "body" : "this is"
      }
    }    
}'

How can I access the frequency of a phrase ?

like image 698
Gilles Cuyaubere Avatar asked Oct 04 '17 15:10

Gilles Cuyaubere


1 Answers

You can use termvectors. As written in documentation

Return values edit

Three types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics. Term information edit

term frequency in the field (always returned)
term positions (positions : true)
start and end offsets (offsets : true)
term payloads (payloads : true), as base64 encoded bytes

you have to reach term frequency - in the example you can see that there is the frequency for john doe in doc. Pay attention that termvector duplicate the disk space occupation for the field on which it is applied

like image 152
Lupanoide Avatar answered Nov 15 '22 08:11

Lupanoide