Test data:
curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '{ "body": "this is a test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "and this is another test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "this thing is a test" }'
My goal is to get the frequency of a phrase in a document.
I know how to get the frequency of the terms in a document:
curl -g "http://localhost:9200/customer/external/1/_termvectors?pretty" -d'
{
"fields": ["body"],
"term_statistics" : true
}'
And I know how to count the documents that contains a given phrase (with match_phrase or span_near query):
curl -g "http://localhost:9200/customer/_count?pretty" -d'
{
"query": {
"match_phrase": {
"body" : "this is"
}
}
}'
How can I access the frequency of a phrase ?
You can use termvectors. As written in documentation
Return values edit
Three types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics. Term information edit
term frequency in the field (always returned) term positions (positions : true) start and end offsets (offsets : true) term payloads (payloads : true), as base64 encoded bytes
you have to reach term frequency - in the example you can see that there is the frequency for john doe in doc. Pay attention that termvector duplicate the disk space occupation for the field on which it is applied
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With