Elastic Search 1.6
I want to index text that contains hyphens, for example U-12, U-17, WU-12, t-shirt... and to be able to use a "Simple Query String" query to search on them.
Data sample (simplified):
{"title":"U-12 Soccer",
"comment": "the t-shirts are dirty"}
As there are quite a lot of questions already about hyphens, I tried the following solution already:
Use a Char filter: ElasticSearch - Searching with hyphens in name.
So I went for this mapping:
{
"settings":{
"analysis":{
"char_filter":{
"myHyphenRemoval":{
"type":"mapping",
"mappings":[
"-=>"
]
}
},
"analyzer":{
"default":{
"type":"custom",
"char_filter": [ "myHyphenRemoval" ],
"tokenizer":"standard",
"filter":[
"standard",
"lowercase"
]
}
}
}
},
"mappings":{
"test":{
"properties":{
"title":{
"type":"string"
},
"comment":{
"type":"string"
}
}
}
}
}
Searching is done with the following query:
{"_source":true,
"query":{
"simple_query_string":{
"query":"<Text>",
"default_operator":"AND"
}
}
}
What works:
"U-12", "U*", "t*", "ts*"
What didn't work:
"U-*", "u-1*", "t-*", "t-sh*", ...
So it seems the char filter is not executed on search strings? What could I do to make this work?
To search data in Elasticsearch index or alias double-click the index or alias in the Elasticsearch tool window or open the console and write search request manually. To execute the request click or press ⌃ ⏎. If the request succeeds the search response panel is shown.
After indexing, you can search, sort, and filter complete documents—not rows of columnar data. This is a fundamentally different way of thinking about data and is one of the reasons ElasticSearch can perform a complex full-text search. Documents are represented as JSON objects.
Returns search hits that match the query defined in the request. If the Elasticsearch security features are enabled, you must have the read index privilege for the target data stream, index, or alias.
If collection isn’t finished when the period ends, Elasticsearch uses only the hits accumulated up to that point. The overall latency of a search request depends on the number of shards needed for the search and the number of concurrent shard requests.
If anyone is still looking for a simple workaround to this issue, replace hyphen with underscore _
when indexing data.
For eg, O-000022334 should indexed as O_000022334.
When searching, replace underscore back to hyphen again when displaying results. This way you can search for "O-000022334" and it will find a correct match.
The answer is really simple:
Quote from Igor Motov: Configuring the standard tokenizer
By default the simple_query_string query doesn't analyze the words with wildcards. As a result it searches for all tokens that start with i-ma. The word i-mac doesn't match this request because during analysis it's split into two tokens i and mac and neither of these tokens starts with i-ma. In order to make this query find i-mac you need to make it analyze wildcards:
{
"_source":true,
"query":{
"simple_query_string":{
"query":"u-1*",
"analyze_wildcard":true,
"default_operator":"AND"
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With