Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can we deal with NULL values that have specific meanings?

Issue

I'm trying to save a boolean value to elasticsearch, but it is specifically valid for it to be NULL. A sort of don't-care in this case.

There seem to be several options, but it is not completely clear what the best would be.

We are using ElasticSearch version 5.0.2

Option 1

The trivial one would be to save it as a boolean with NULL values. Those would be seen as 'missing' by ES.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_boolean": { "type": "boolean"}
      }
    }
  }
}

PUT my_index/my_type/1
{"my_boolean": true}

PUT my_index/my_type/2
{"my_boolean": false}

PUT my_index/my_type/3
{"my_boolean": null}

This has several issues, one of them being aggregations. There doesn't seem to be an easy way to get the values true, false and NULL in an aggregation.

The missing feature is known to me, so I know I can do the following:

GET my_index/_search
{
  "size":0,
  "aggregations": {
    "my_boolean": {
      "terms": {
        "field": "my_boolean"
      }
    },
    "missing_fields": {
          "missing" : {
            "field": "my_boolean"
          }
    }
  }
}

But this will result in a bucket with 2 values (true/false) and a separate count for the missing documents. That looks like it will cause problems.

Option 2

Another option is to actually give the NULL a value, as described in the manual. The problem is that the value needs to be the correct type, and there is nothing but true and false as a boolean.

The null_value needs to be the same datatype as the field. For instance, a long field cannot have a string null_value.

This means we can use a different type that supports more then 2 values, e.g. integer, but that would be in my head the same as saying: lets map it as integer, and define 1 as true, 2 as false and 3 as null. This would work, but we'd have an implicit mapping that all should know about. (All producers/consumers/whatyamahaveits).

Option 3

A final version would be to try and script our way out of this problem.

GET my_index/_search
{
  "size":0,
  "aggregations": {

    "my_boolean": {
      "terms": {
       "script" : {
        "inline": "if(doc['my_boolean'].length === 1) { if(doc['my_boolean'].value === true){ return 1;} else {return 2;} } else { return 3;}"
        }
      }
    }
  }
}

Now we do get the right results in somewhat sane buckets.

"aggregations": {
"my_boolean": {
  "doc_count_error_upper_bound": 0,
  "sum_other_doc_count": 0,
  "buckets": [
    {
      "key": "1",
      "doc_count": 1
    },
    {
      "key": "2",
      "doc_count": 1
    },
    {
      "key": "3",
      "doc_count": 1
    }
  ]
}
}

Note that we still have an implicit mapping with the keys here, so this seems to have some of the same issues that mapping it as an integer has. But still, your datatype is what it should be, so that might be something. Note that we cannot have a bucket with 'null' as key. We can call them "true", "false" and "null" (strings) ofcourse, but this is the same situation, but hidden even more.

Question

what is the best way to deal with this null-problem? (Or maybe we should call it a 'tri-state-boolean-problem'?)

To clarify: we fear that later on a 'non-standard' value might cause problems. The first we saw was the bucketing which we might be able to fix with above script solution, but maybe we run into other issues later. So we are looking for the best-practice of saving this type of data, rather then a quick solution for a specific problem.

like image 711
Nanne Avatar asked Dec 08 '16 11:12

Nanne


1 Answers

You could use the missing setting of the terms aggregation (i.e. not a separate missing aggregation).

That way, you could keep using your boolean field and get your three buckets with 0, 1 and -1 (for null)?

{
  "size":0,
  "aggregations": {
    "my_boolean": {
      "terms": {
        "field": "my_boolean",
        "missing": -1                 <--- add this
      }
    }
  }
}

It doesn't have the disadvantage of having to change the field type and encoding it into some other data type (integer / string) and also frees you from leveraging scripting, since that won't scale very well.

like image 139
Val Avatar answered Oct 22 '22 07:10

Val