Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching with missing spaces in ElasticSearch

I have documents that I want to index in ElasticSearch that contains a text field called name. I currently index the name using the snowball analyzer. However, I would like to match names both with and without included spaces. For example, a document with the name "The Home Depot" should match "homedepot", "home", and "home depot". Additionally, documents with a single word name like "ExxonMobil" should match "exxon mobil" and "exxonmobil".

I can't seem to find the right combination of analyzer/filters to accomplish this.

like image 504
David Pfeffer Avatar asked Nov 18 '13 16:11

David Pfeffer


People also ask

How ElasticSearch match query works?

The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term. (Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field> .

How do I search all fields in ElasticSearch?

Either the query_string query or the match query would be what you're looking for. query_string will use the special _all field if none is specified in default_field , so that would work out well. And with match you can just specify the _all as well. Save this answer.

What is ElasticSearch full text search?

Full-text search queries and performs linguistic searches against documents. It includes single or multiple words or phrases and returns documents that match search condition. ElasticSearch is a search engine based on Apache Lucene, a free and open-source information retrieval software library.

How do you search in ElasticSearch?

You can use the search API to search and aggregate data stored in Elasticsearch data streams or indices. The API's query request body parameter accepts queries written in Query DSL. The following request searches my-index-000001 using a match query. This query matches documents with a user.id value of kimchy .


1 Answers

I think the most direct approach to this problem would be to apply a Shingle token filter, which, instead of creating ngrams of characters, creates combinations of incoming tokens. You can add it to your analyzer something like:

filter:
    ........
    my_shingle_filter:
        type: shingle
        min_shingle_size: 2
        max_shingle_size: 3
        output_unigrams: true
        token_separator: ""

you should be mindful of where this filter is placed in your filter chain. It should probably come late in the chain, after all token separation/removal/replacement has already occurred (ie. after any StopFilters, SynonymFilters, stemmers, etc).

like image 146
femtoRgon Avatar answered Oct 16 '22 23:10

femtoRgon