I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.
Request:
http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true
The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.
Response:
This is what I get for the term "cancer" for one of the fields:
"cancer" : {
"doc_freq" : 5297,
"ttf" : 10587,
"term_freq" : 1,
"tokens" : [
{
"position" : 15,
"start_offset" : 115,
"end_offset" : 121
}
]
},
If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.
Any advice here would be greatly appreciated.
You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.
The doc. count represents the number of documents indexed in your index while index_total stands for number of indexing operations performed during elasticsearch uptime.
In Elasticsearch, an index (plural: indices) contains a schema and can have one or more shards and replicas. An Elasticsearch index is divided into shards and each shard is an instance of a Lucene index. Indices are used to store the documents in dedicated data structures corresponding to the data type of fields.
The reason for the difference in the count is because term vectors are not accurate unless the index in question has a single shard. For indexes with multiple shards, the documents are distributed all over the shards, hence the frequency returned isn't the total but from a randomly selected shard.
Thus, the returned frequency is just a relative measure and not the absolute value you expect. see the Behaviour section. To test this, you can create a single shard index and request the frequency (it should give you the actual total).
I believe you need to turn term_statistics to true as per elasticsearch documentation:
Term statistics Setting term_statistics to true (default is false) will return
total term frequency (how often a term occurs in all documents)
document frequency (the number of documents containing the current term)
By default these values are not returned since term statistics can have a serious performance impact.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With