Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch 5.2.2: terms aggregation case insensitive

I am attempting to do a case-insensitive aggregation on a keyword type field, but I'm having issues in getting this to work.

What I have tried so far is to add a custom analyzer called "lowercase" which uses the "keyword" tokenizer, and "lowercase" filter. I then added a field to the mapping called "use_lowercase" for the field I want to work with. I wanted to retain the existing "text" and "keyword" field components as well since I may want to search for the terms within the field.

Here is the index definition, including the custom analyzer:

PUT authors
{
  "settings": {
    "analysis": {
      "analyzer": {
        "lowercase": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": "lowercase"
        }
      }
    }
  },
  "mappings": {
    "famousbooks": {
      "properties": {
        "Author": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            },
            "use_lowercase": {
              "type": "text",
              "analyzer": "lowercase"
            }
          }
        }
      }
    }
  }
}

Now I add 2 records with the same Author, but with different case:

POST authors/famousbooks/1
{
  "Book": "The Mysterious Affair at Styles",
  "Year": 1920,
  "Price": 5.92,
  "Genre": "Crime Novel",
  "Author": "Agatha Christie"
}

POST authors/famousbooks/2
{
  "Book": "And Then There Were None",
  "Year": 1939,
  "Price": 6.99,
  "Genre": "Mystery Novel",
  "Author": "Agatha christie"
}

So far so good. Now if I do a terms aggregation based on Author,

GET authors/famousbooks/_search
{
  "size": 0,
  "aggs": {
    "authors-aggs": {
      "terms": {
        "field": "Author.use_lowercase"
      }
    }
  }
}

I get the following result:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "authors",
        "node": "yxcoq_eKRL2r6JGDkshjxg",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
    }
  },
  "status": 400
}

So it seems to me that the aggregation is thinking that the search field is text instead of keyword, and hence giving me the fielddata warning. I would think that ES would be sophisticated enough to recognize that the terms field is in fact a keyword (via custom analyzer) and therefore aggregate-able, but that doesn't appear to be the case.

If I add "fielddata":true to the mapping for Author, the aggregation then works fine, but I'm hesitant to do this given the dire warnings of high heap usage when setting this value.

Is there a best practice for doing this type of insensitive keyword aggregation? I was hoping I could just say "type":"keyword", "filter":"lowercase" in the mappings section but that is not available it seems.

It feels like I'm having to use too big of a stick to get this to work if I go the "fielddata":true route. Any help on this would be appreciated!

like image 929
GoodEnuf Avatar asked Feb 28 '17 19:02

GoodEnuf


1 Answers

Turns out the solution is to use a custom normalizer instead of a custom analyzer.

PUT authors
{
  "settings": {
    "analysis": {
      "normalizer": {
        "myLowercase": {
          "type": "custom",
          "filter": [ "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "famousbooks": {
      "properties": {
        "Author": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            },
            "use_lowercase": {
              "type": "keyword",
              "normalizer": "myLowercase",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

This then allows terms aggregation using field Author.use_lowercase without issue.

like image 93
GoodEnuf Avatar answered Oct 06 '22 12:10

GoodEnuf