Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch - Analyser creating the right tokens but query is not matching

I'm trying to make Elasticsearch ignore hyphens. I don't want it to split either side of the hyphen into seperate words. It seems simple but I'm banging my head on the wall.

I want the string "Roland JD-Xi" to produce the following terms: [ roland jd-xi, roland, jd-xi, jdxi, roland jdxi ]

I haven't been able to achieve this easily. Most people will just type 'jdxi' so my initial thought would be to just remove the hyphen. So I'm using the following definition

  name: {
"type": "string",
"analyzer": "language",
"include_in_all": true,
"boost": 5,
"fields": {
    "my_standard": {
        "type": "string",
        "analyzer": "my_standard"
    },
    "my_prefix": {
        "type": "string",
        "analyzer": "my_text_prefix",
        "search_analyzer": "my_standard"
    },
    "my_suffix": {
        "type": "string",
        "analyzer": "my_text_suffix",
        "search_analyzer": "my_standard"
    }
}

}

And the relevant analyser and filters are defined as

{
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
    "analyzer": {
        "std": {
            "tokenizer": "standard",
            "char_filter": "html_strip",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "length", "strip_hyphens"]
        ...
        "my_text_prefix": {
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_front"]
        },
        "my_text_suffix": {
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_back"]
        },
        "my_standard": {
            "type": "custom",
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase"]
        }
    },
    "char_filter": {
        "my_filter": {
            "type": "mapping",
            "mappings": ["- => ", ". => "]
        }
    },
    "filter": {
        "edge_ngram_front": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20,
            "side": "front"
        },
        "edge_ngram_back": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20,
            "side": "back"
        },
        "strip_spaces": {
            "type": "pattern_replace",
            "pattern": "\\s",
            "replacement": ""
        },
        "strip_dots": {
            "type": "pattern_replace",
            "pattern": "\\.",
            "replacement": ""
        },
        "strip_hyphens": {
            "type": "pattern_replace",
            "pattern": "-",
            "replacement": ""
        },
        "stop": {
            "type": "stop",
            "stopwords": "_none_"
        },
        "length": {
            "type": "length",
            "min": 1
        }
    }
}

I've been able to test (i.e. _analyze) this and the string "Roland JD-Xi" is tokenised as [ roland, jdxi ]

It not exactly what I want but close enough as it should match 'jdxi'.

But thats my problem. If I do a simple "index/_search?q=jdxi" it doesn't bring back the document. However if I do a "index/_search?q=roland+jdxi" it does bring back the document.

So at least I know the hyphens are being removed but if the tokens "roland" and "jdxi" are being created how come "index/_search?q=jdxi" doesn't match the document?

  1. Is my problem with the index process or the query process?
  2. How do I fix it?
  3. Can anyone explain how to achieve the desired tokens [ roland jd-xi, roland, jd-xi, jdxi, roland jdxi ]
like image 373
user2023210 Avatar asked Mar 21 '18 11:03

user2023210


People also ask

What is difference between analyzer and tokenizer in ElasticSearch?

The key difference is that normalizers can only emit a single token while analyzers can emit many. Since they only emit one token, normalizers do not use a tokenizer. They do use character filters and token filters but are limited to using those that work at a single character at a time.

What is the use of analyzer in ElasticSearch?

In a nutshell an analyzer is used to tell elasticsearch how the text should be indexed and searched. And what you're looking into is the Analyze API, which is a very nice tool to understand how analyzers work. The text is provided to this API and is not related to the index.

What are Tokenizers in ElasticSearch?

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace.


1 Answers

I've reproduced your case on ES 6 and searching for index/_search?q=jdxi returns the document.

The issue could be that when searching for index/_search?q=jdxi without specifying a field, it will basically search in _all which contains whatever was in the name field (basically the same as index/_search?q=name:jdxi). Since that field was not analyzed using your my_standard analyzer, you don't get any results.

What you should do instead is searching using the my_standard sub-field, i.e. index/_search?q=name.my_standard:jdxi and pretty sure you'll get the document.

like image 114
Val Avatar answered Sep 28 '22 16:09

Val