Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch and Regex queries

I am trying to query for documents that have dates within the body of the "content" field.

curl -XGET 'http://localhost:9200/index/_search' -d '{
    "query": {
        "regexp": {
            "content": "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$" 
            }
        }
    }'

Getting closer maybe?

curl -XGET 'http://localhost:9200/index/_search' -d '{
        "filtered": {
        "query": {
            "match_all": {}
        },
        "filter": {
            "regexp":{
                "content" : "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
                }
            }
        }
    }'

My regex seems to have been off. This regex has been validated on regex101.com The following query still returns nothing from the 175k documents I have.

curl -XPOST 'http://localhost:9200/index/_search?pretty=true' -d '{
        "query": {
            "regexp":{
                "content" : "/[0-9]{4}-[0-9]{2}-[0-9]{2}|[0-9]{2}-[0-9]{2}-[0-9]{4}|[0-9]{2}/[0-9]{2}/[0-9]{4}|[0-9]{4}/[0-9]{2}/[0-9]{2}/g"
            }
        }
    }'

I am starting to think that my index might not be set up for such a query. What type of field do you have to use to be able to use regular expressions?

mappings: {
    doc: {
        properties: {
            content: {
                type: string
            }title: {
                type: string
            }host: {
                type: string
            }cache: {
                type: string
            }segment: {
                type: string
            }query: {
                properties: {
                    match_all: {
                        type: object
                    }
                }
            }digest: {
                type: string
            }boost: {
                type: string
            }tstamp: {
                format: dateOptionalTimetype: date
            }url: {
                type: string
            }fields: {
                type: string
            }anchor: {
                type: string
            }
        }
    }

I want to find any record that has a date and graph the volume of documents by that date. Step 1. is to get this query working. Step 2. will be to pull the dates out and group them by them accordingly. Can someone suggest a way to get the first part working as I know the second part will be really tricky.

Thanks!

like image 996
aeupinhere Avatar asked Aug 14 '14 16:08

aeupinhere


1 Answers

You should read Elasticsearch's Regexp Query documentation carefully, you are making some incorrect assumptions about how the regexp query works.

Probably the most important thing to understand here is what the string you are trying to match is. You are trying to match terms, not the entire string. If this is being indexed with StandardAnalyzer, as I would suspect, your dates will be separated into multiple terms:

  • "01/01/1901" becomes tokens "01", "01" and "1901"
  • "01 01 1901" becomes tokens "01", "01" and "1901"
  • "01-01-1901" becomes tokens "01", "01" and "1901"
  • "01.01.1901" actually will be a single token: "01.01.1901" (Due to decimal handling, see UAX #29)

You can only match a single, whole token with a regexp query.

Elasticsearch (and lucene) don't support full Perl-compatible regex syntax.

In your first couple of examples, you are using anchors, ^ and $. These are not supported. Your regex must match the entire token to get a match anyway, so anchors are not needed.

Shorthand character classes like \d (or \\d) are also not supported. Instead of \\d\\d, use [0-9]{2}.

In your last attempt, you are using /{regex}/g, which is also not supported. Since your regex needs to match the whole string, the global flag wouldn't even make sense in context. Unless you are using a query parser which uses them to denote a regex, your regex should not be wrapped in slashes.

(By the way: How did this one validate on regex101? You have a bunch of unescaped /s. It complains at me when I try it.)


To support this sort of query on such an analyzed field, you'll probably want to look to span queries, and particularly Span Multiterm and Span Near. Perhaps something like:

{
    "span_near" : {
        "clauses" : [
            { "span_multi" : { 
                "match": {
                    "regexp": {"content": "0[1-9]|[12][0-9]|3[01]"}
                }
            }},
            { "span_multi" : { 
                "match": {
                    "regexp": {"content": "0[1-9]|1[012]"}
                }
            }},
            { "span_multi" : { 
                "match": {
                    "regexp": {"content": "(19|20)[0-9]{2}"} 
                }
            }}
        ],
        "slop" : 0,
        "in_order" : true
    }
}
like image 176
femtoRgon Avatar answered Sep 22 '22 14:09

femtoRgon