I have a temporary index with documents that I need to moderate. I want to group these documents by the words they contain.
For example, I have these documents:
1 - "aaa bbb ccc ddd eee fff"
2 - "bbb mmm aaa fff xxx"
3 - "hhh aaa fff"
So, I want to get the most popular words, ideally with counts: "aaa" - 3, "fff" - 3, "bbb" - 2, etc.
Is this possible with elasticsearch?
Elasticsearch will get significant slower if you just add some big number as size, one method to use to get all documents is using scan and scroll ids. The results from this would contain a _scroll_id which you have to query to get the next 100 chunk. This answer needs more updates.
Each Elasticsearch shard is a Lucene index. The maximum number of documents you can have in a Lucene index is 2,147,483,519. The Lucene index is divided into smaller files called segments.
Descriptionedit. You use GET to retrieve a document and its source or stored fields from a particular index. Use HEAD to verify that a document exists. You can use the _source resource retrieve just the document source or verify that it exists.
Doing a simple term aggregation search will meet your needs:
(where mydata
is the name of your field)
curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
"query": {
"match_all" : {}
},
"aggs" : {
"mydata_agg" : {
"terms": {"field" : "mydata"}
}
}
}'
will return:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"mydata_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "aaa",
"doc_count" : 3
}, {
"key" : "fff",
"doc_count" : 3
}, {
"key" : "bbb",
"doc_count" : 2
}, {
"key" : "ccc",
"doc_count" : 1
}, {
"key" : "ddd",
"doc_count" : 1
}, {
"key" : "eee",
"doc_count" : 1
}, {
"key" : "hhh",
"doc_count" : 1
}, {
"key" : "mmm",
"doc_count" : 1
}, {
"key" : "xxx",
"doc_count" : 1
} ]
}
}
}
It might be because this question and the accepted answer are some years old, but now there is a better way.
The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.
This is usually the case for fields that contain data of type text
and not keyword
.
This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
From the docs:
text
fieldsIt can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.
So, in your case you would send a query like this (leaving out the filtering/sampling):
{
"aggs": {
"keywords": {
"significant_text": {
"field": "myfield"
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With