Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting total term frequency throughout entire index (Elasticsearch)

I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.

Request:

http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true

The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.

Response:

This is what I get for the term "cancer" for one of the fields:

 "cancer" : {
      "doc_freq" : 5297,
      "ttf" : 10587,
      "term_freq" : 1,
      "tokens" : [
        {
          "position" : 15,
          "start_offset" : 115,
          "end_offset" : 121
        }
      ]
    },

If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.

Any advice here would be greatly appreciated.

like image 316
liamjc Avatar asked Jan 18 '17 04:01

liamjc


People also ask

What is the Elasticsearch query to get all documents from an index?

You can use cURL in a UNIX terminal or Windows command prompt, the Kibana Console UI, or any one of the various low-level clients available to make an API call to get all of the documents in an Elasticsearch index. All of these methods use a variation of the GET request to search the index.

What is Docs count in Elasticsearch?

The doc. count represents the number of documents indexed in your index while index_total stands for number of indexing operations performed during elasticsearch uptime.

What is indexing Elasticsearch?

In Elasticsearch, an index (plural: indices) contains a schema and can have one or more shards and replicas. An Elasticsearch index is divided into shards and each shard is an instance of a Lucene index. Indices are used to store the documents in dedicated data structures corresponding to the data type of fields.


2 Answers

The reason for the difference in the count is because term vectors are not accurate unless the index in question has a single shard. For indexes with multiple shards, the documents are distributed all over the shards, hence the frequency returned isn't the total but from a randomly selected shard.

Thus, the returned frequency is just a relative measure and not the absolute value you expect. see the Behaviour section. To test this, you can create a single shard index and request the frequency (it should give you the actual total).

like image 71
rozduva Avatar answered Nov 15 '22 12:11

rozduva


I believe you need to turn term_statistics to true as per elasticsearch documentation:

Term statistics Setting term_statistics to true (default is false) will return

total term frequency (how often a term occurs in all documents)

document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

like image 25
groo Avatar answered Nov 15 '22 13:11

groo