Matching with missing spaces in ElasticSearch

Tags:

I have documents that I want to index in ElasticSearch that contains a text field called name. I currently index the name using the snowball analyzer. However, I would like to match names both with and without included spaces. For example, a document with the name "The Home Depot" should match "homedepot", "home", and "home depot". Additionally, documents with a single word name like "ExxonMobil" should match "exxon mobil" and "exxonmobil".

I can't seem to find the right combination of analyzer/filters to accomplish this.

504

asked Nov 18 '13 16:11

David Pfeffer

1 Answers

I think the most direct approach to this problem would be to apply a Shingle token filter, which, instead of creating ngrams of characters, creates combinations of incoming tokens. You can add it to your analyzer something like:

filter:
    ........
    my_shingle_filter:
        type: shingle
        min_shingle_size: 2
        max_shingle_size: 3
        output_unigrams: true
        token_separator: ""

you should be mindful of where this filter is placed in your filter chain. It should probably come late in the chain, after all token separation/removal/replacement has already occurred (ie. after any StopFilters, SynonymFilters, stemmers, etc).

146

answered Oct 16 '22 23:10

femtoRgon

Related questions
                            
                                Reusing Lucene Query objects
                            
                                Solr query: stop words, OR and AND weirdness
                            
                                The correct way of adding custom query parameter in Solr
                            
                                How to perform a faceted search?
                            
                                Search modules for play framework 2
                            
                                Lucene - Validate completeness of index
                            
                                Fastest full text search today?
                            
                                what is the best lucene setup for ranking exact matches as the highest
                            
                                Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?
                            
                                Solr/Lucene fieldCache OutOfMemory error sorting on dynamic field
                            
                                Lucene pagination with TopScoreDocCollector
                            
                                Use function query for boosting score in Solr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Matching with missing spaces in ElasticSearch

Tags:

lucene

elasticsearch

David Pfeffer

People also ask

1 Answers

femtoRgon

Recent Activity

Donate For Us