Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch - How to get popular words list of documents

I have a temporary index with documents that I need to moderate. I want to group these documents by the words they contain.

For example, I have these documents:

1 - "aaa bbb ccc ddd eee fff"

2 - "bbb mmm aaa fff xxx"

3 - "hhh aaa fff"

So, I want to get the most popular words, ideally with counts: "aaa" - 3, "fff" - 3, "bbb" - 2, etc.

Is this possible with elasticsearch?

like image 334
o139 Avatar asked Jan 02 '15 11:01

o139


People also ask

How do I get all documents in Elasticsearch?

Elasticsearch will get significant slower if you just add some big number as size, one method to use to get all documents is using scan and scroll ids. The results from this would contain a _scroll_id which you have to query to get the next 100 chunk. This answer needs more updates.

How many documents can Elasticsearch handle?

Each Elasticsearch shard is a Lucene index. The maximum number of documents you can have in a Lucene index is 2,147,483,519. The Lucene index is divided into smaller files called segments.

What should you use to fetch a document in Elasticsearch?

Descriptionedit. You use GET to retrieve a document and its source or stored fields from a particular index. Use HEAD to verify that a document exists. You can use the _source resource retrieve just the document source or verify that it exists.


2 Answers

Doing a simple term aggregation search will meet your needs:

(where mydata is the name of your field)

curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
  "query": {
    "match_all" : {}
  },
  "aggs" : {
      "mydata_agg" : {
    "terms": {"field" : "mydata"}
    }
  }
}'

will return:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "mydata_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "aaa",
        "doc_count" : 3
      }, {
        "key" : "fff",
        "doc_count" : 3
      }, {
        "key" : "bbb",
        "doc_count" : 2
      }, {
        "key" : "ccc",
        "doc_count" : 1
      }, {
        "key" : "ddd",
        "doc_count" : 1
      }, {
        "key" : "eee",
        "doc_count" : 1
      }, {
        "key" : "hhh",
        "doc_count" : 1
      }, {
        "key" : "mmm",
        "doc_count" : 1
      }, {
        "key" : "xxx",
        "doc_count" : 1
      } ]
    }
  }
}
like image 160
Olly Cruickshank Avatar answered Oct 08 '22 21:10

Olly Cruickshank


It might be because this question and the accepted answer are some years old, but now there is a better way.

The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.

This is usually the case for fields that contain data of type text and not keyword.

This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
From the docs:

  • It is specifically designed for use on type text fields
  • It does not require field data or doc-values
  • It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

It can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.

So, in your case you would send a query like this (leaving out the filtering/sampling):

{
    "aggs": {
        "keywords": {
            "significant_text": {
                "field": "myfield"
            }
        }
    }
}
like image 34
Aron Fiechter Avatar answered Oct 08 '22 20:10

Aron Fiechter