In elasticsearch 6.2 I have a parent-child relationship :
Document -> NamedEntity
I want to aggregate NamedEntity by counting mention field and giving the number of documents that contains each named entity.
My use case is :
doc1 contains 'NER'(_id=ner11), 'NER'(_id=ner12)
doc2 contains 'NER'(_id=ner2)
The parent/child relation is implemented with a join field. In the Document I have a field :
join: {
name: "Document"
}
And in the NamedEntity children :
join: {
name: "NamedEntity",
parent: "parent_id"
}
with _routing set to parent_id.
So I tried with terms sub-aggregation :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"terms":{"field":"join"}
}
}
}
}
}'
And I have the following response :
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NamedEntity",
"doc_count" : 3 <-- WRONG ! There are 2 distinct documents
}
]
}
}
]
}
I find the expected 3 occurrences in mentions.buckets.doc_count. But in the mentions.buckets.docs.buckets.doc_count field I would like to have only 2 documents (not 3). Like a select count distinct.
If I aggregate with "terms":{"field":"join.parent"} I have :
...
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
...
I tied with cardinality aggregation on the join field and I obtain a value of 1, and cardinality aggregation on the join.parent that returns a value of 0.
So how do you make an aggregation distinct count on parents without the use of a reverse nested aggregation ?
As @AndreiStefan asked, here is the mapping. It is a simple 1-N relation between Document(content) and NamedEntity(mention) in an ES 6 mapping (fields are defined on the same level) :
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"mention": {
"type": "keyword"
}
}
}
}}
And the requests for a minimal dataset :
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc1 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "a NER document contains 2 NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc2 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "another NER document"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner11?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner12?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner2?routing=doc2 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc2"}, "mention": "NER"}'
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"docs": {
"terms": {
"field": "join"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
}
}
OR if you just want the count:
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you need a custom ordering (by unique counts):
"aggs": {
"mentions": {
"terms": {
"field": "mention",
"order": {
"uniques": "desc"
}
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With