Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignore leading zeros with Elasticsearch

I am trying to create a search bar where the most common query will be for a "serviceOrderNo". "serviceOrderNo" is not a number field in the database, it is a string field. Examples:

000000007
000000002
WO0000042
123456789
AllTextss
000000054
000000065
000000874

The most common format is just an integer proceeded by some number of zeros.

How do I set up Elasticsearch so that searching for "65" will match "000000065"? I also want to give precedence to the "serviceOrderNo" field (which I already have working). Here is where I am at right now:

{
   "query": {
      "multi_match": {
         "query": "65",
         "fields": ["serviceOrderNo^2", "_all"],
      }
   }
}
like image 778
Josh Graham Avatar asked Jun 04 '15 17:06

Josh Graham


1 Answers

One way of doing this is using the lucene flavour regular exression query:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html

"query": {
     "regexp":{
        "serviceOrderNo": "[0]*65"
     }
}

Also, the Query String query also supports a small set of special characters, more limited set of regular expression characters, AS WELL AS lucene regular expressions the query would look like this: https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-query-string-query.html

"query": {
    "query_string": {
       "default_field": "serviceOrderNo",
       "query": "0*65"
    }
}

These are fairly simple Regular expressions, both saying match the character(s) contained in the brackets [0] or the character 0 unlimited times *.

If you have the ability to reindex, or haven't indexed your data yet, you also have the ability to make this easier on yourself by writing a custom analyzer. Right now, you are using the default analyzer for Strings on your serviceOrderNo field. When you index "serviceOrderNo":"00000065" ES interprets this simply as 00000065.

Your custom analyzer could tokenize this field int both "0000065" and "65", using the same regular expression. The benefit of this is that the Regex only runs once at index time, instead of every time you run your query because ES will search against both "0000065" and "65".

You can also check out the ES website documentation on Analyzers.

"settings":{
    "analysis": {
        "filter":{
           "trimZero": {
                "type":"pattern_capture",
                "patterns":"^0*([0-9]*$)"
           }
        },
       "analyzer": {
           "serviceOrderNo":{
               "type":"custom",
               "tokenizer":"standard",
               "filter":"trimZero"
           }
        }
    }
},
"mappings":{
    "serviceorderdto": {
        "properties":{
            "serviceOrderNo":{
                "type":"String",
                "analyzer":"serviceOrderNo"
            }
        }
    }
}
like image 83
IanGabes Avatar answered Oct 24 '22 06:10

IanGabes