Issue
I'm trying to save a boolean value to elasticsearch, but it is specifically valid for it to be NULL. A sort of don't-care in this case.
There seem to be several options, but it is not completely clear what the best would be.
We are using ElasticSearch version 5.0.2
Option 1
The trivial one would be to save it as a boolean with NULL values. Those would be seen as 'missing' by ES.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"my_boolean": { "type": "boolean"}
}
}
}
}
PUT my_index/my_type/1
{"my_boolean": true}
PUT my_index/my_type/2
{"my_boolean": false}
PUT my_index/my_type/3
{"my_boolean": null}
This has several issues, one of them being aggregations. There doesn't
seem to be an easy way to get the values true
, false
and NULL
in
an aggregation.
The missing
feature is known to me, so I know I can do the following:
GET my_index/_search
{
"size":0,
"aggregations": {
"my_boolean": {
"terms": {
"field": "my_boolean"
}
},
"missing_fields": {
"missing" : {
"field": "my_boolean"
}
}
}
}
But this will result in a bucket with 2 values (true/false) and a separate count for the missing documents. That looks like it will cause problems.
Option 2
Another option is to actually give the NULL a value, as described in the manual. The problem is that the value needs to be the correct type, and there is nothing but true and false as a boolean.
The null_value needs to be the same datatype as the field. For instance, a long field cannot have a string null_value.
This means we can use a different type that supports more then 2 values, e.g. integer, but that would be in my head the same as saying: lets map it as integer, and define 1 as true, 2 as false and 3 as null. This would work, but we'd have an implicit mapping that all should know about. (All producers/consumers/whatyamahaveits).
Option 3
A final version would be to try and script our way out of this problem.
GET my_index/_search
{
"size":0,
"aggregations": {
"my_boolean": {
"terms": {
"script" : {
"inline": "if(doc['my_boolean'].length === 1) { if(doc['my_boolean'].value === true){ return 1;} else {return 2;} } else { return 3;}"
}
}
}
}
}
Now we do get the right results in somewhat sane buckets.
"aggregations": {
"my_boolean": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 1
},
{
"key": "2",
"doc_count": 1
},
{
"key": "3",
"doc_count": 1
}
]
}
}
Note that we still have an implicit mapping with the keys here, so this seems to have some of the same issues that mapping it as an integer has. But still, your datatype is what it should be, so that might be something. Note that we cannot have a bucket with 'null' as key. We can call them "true", "false" and "null" (strings) ofcourse, but this is the same situation, but hidden even more.
Question
what is the best way to deal with this null-problem? (Or maybe we should call it a 'tri-state-boolean-problem'?)
To clarify: we fear that later on a 'non-standard' value might cause problems. The first we saw was the bucketing which we might be able to fix with above script solution, but maybe we run into other issues later. So we are looking for the best-practice of saving this type of data, rather then a quick solution for a specific problem.
You could use the missing
setting of the terms
aggregation (i.e. not a separate missing
aggregation).
That way, you could keep using your boolean field and get your three buckets with 0, 1 and -1 (for null)?
{
"size":0,
"aggregations": {
"my_boolean": {
"terms": {
"field": "my_boolean",
"missing": -1 <--- add this
}
}
}
}
It doesn't have the disadvantage of having to change the field type and encoding it into some other data type (integer / string) and also frees you from leveraging scripting, since that won't scale very well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With