Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search with various combinations of space, hyphen, casing and punctuations

My schema:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="1" catenateNumbers="1" catenateAll="0"
            splitOnCaseChange="1" splitOnNumerics="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English"
            protected="protwords.txt"/>
  </analyzer>
</fieldType>

Combinations that I want to work:

"Walmart", "WalMart", "Wal Mart", "Wal-Mart", "Wal-mart"

Given any of these strings, I want to find the other one.

So, there are 25 such combinations as given below:

(First column denotes input text for search, second column denotes expected match)

(Walmart,Walmart)
(Walmart,WalMart)
(Walmart,Wal Mart)
(Walmart,Wal-Mart)
(Walmart,Wal-mart)
(WalMart,Walmart)
(WalMart,WalMart)
(WalMart,Wal Mart)
(WalMart,Wal-Mart)
(WalMart,Wal-mart)
(Wal Mart,Walmart)
(Wal Mart,WalMart)
(Wal Mart,Wal Mart)
(Wal Mart,Wal-Mart)
(Wal Mart,Wal-mart)
(Wal-Mart,Walmart)
(Wal-Mart,WalMart)
(Wal-Mart,Wal Mart)
(Wal-Mart,Wal-Mart)
(Wal-Mart,Wal-mart)
(Wal-mart,Walmart)
(Wal-mart,WalMart)
(Wal-mart,Wal Mart)
(Wal-mart,Wal-Mart)
(Wal-mart,Wal-mart)

Current limitations with my schema:

1. "Wal-Mart" -> "Walmart",
2. "Wal Mart" -> "Walmart",
3. "Walmart"  -> "Wal Mart",
4. "Wal-mart" -> "Walmart",
5. "WalMart"  -> "Walmart"

Screenshot of the analyzer:

Analyzer screenshot using initial schema

I tried various combinations of filters trying to resolve these limitations, so I got stumbled by the solution provided at: Solr - case-insensitive search do not work

While it seems to overcome one of the limitations that I have (see #5 WalMart -> Walmart), it is overall worse than what I had earlier. Now it does not work for cases like:

(Wal Mart,WalMart), 
(Wal-Mart,WalMart), 
(Wal-mart,WalMart), 
(WalMart,Wal Mart)
besides cases 1 to 4 as mentioned above

Analyzer after schema change: enter image description here

Questions:

  1. Why does "WalMart" not match "Walmart" with my initial schema ? Solr analyzer clearly shows me that it had produced 3 tokens during index time: wal, mart, walmart. During query time: It has produced 1 token: walmart (while it's not clear why it would produce just 1 token), I fail to understand why it does not match given that walmart is contained in both query and index tokens.

  2. The problem that I mentioned here is just a single use-case. There are more slightly complex ones like:

    Words with apostrophes: "Mc Donalds", "Mc Donald's", "McDonald's", "Mc donalds", "Mc donald's", "Mcdonald's"

    Words with different punctuations: "Mc-Donald Engineering Company, Inc."

In general, what's the best way to go around modeling the schema with this kind of requirement ? NGrams ? Index same data in different fields (in different formats) and use copyField directive (https://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields) ? What are the performance implications of this ?

EDIT: The default operator in my Solr schema is AND. I cannot change it to OR.

like image 866
Sudheer Aedama Avatar asked Apr 21 '15 21:04

Sudheer Aedama


2 Answers

We considered hyphenated words as a special case and wrote a custom analyzer that was used at index time to create three versions of this token, so in your case wal-mart would become walmart, wal mart and wal-mart. Each of these synonyms were written out using a custom SynonymFilter that was initially adapted from an example in the Lucene in Action book. The SynonymFilter sat between the Whitespace tokenizer and the Lowercase tokenizer.

At search time, either of the three versions would match one of the synonyms in the index.

like image 162
Sujit Pal Avatar answered Nov 10 '22 16:11

Sujit Pal


Why does "WalMart" not match "Walmart" with my initial schema?

Because you have defined the mm parameter of your DisMax/eDismax handler with a too high value. I have played around with it. When you define the mm value to 100% you will get no match. But why?

Because you are using the same analyzer for query and index time. Your search term "WalMart" is separated into 3 tokens (words). Namely these are "wal", "mart" and "walmart". Solr will now treat each word individually when counting towards the <str name="mm">100%</str>*.

By the way I have reproduced your problem, but there the problem occurs when indexing Walmart, but querying with WalMart. When performing it the other way around, it works fine.

You can override this by using LocalParams, you could rephrase your query like this {!mm=1}WalMart.

There are more slightly complex ones like [ ... ] "Mc Donald's" [ to match ] Words with different punctuations: "Mc-Donald Engineering Company, Inc."

Here also playing with the mm parameter helps.

In general, what's the best way to go around modeling the schema with this kind of requirement?

Here I agree with Sujit Pal, you should go and implement an own copy of the SynonymFilter. Why? Because it works differently from the other filters and tokenizers. It creates tokens inplace the offset of the indexed words.

What inplace? It will not increase the token count of your query. And you can perform the back hyphenation (joining two words that are separated by a blank).

But we are lacking a good synonyms.txt and cannot keep it up-to-date.

When extending or copying the SynonymFilter ignore the static mapping. You may remove the code that maps the words. You just need the offset handling.

Update I think you can also try the PatternCaptureGroupTokenFilter, but tackling company names with regular expressions may soon face its' limits. I will have a look into this later.


* You can find this in your solrconfig.xml, have a look for your <requestHandler ... />

like image 39
cheffe Avatar answered Nov 10 '22 14:11

cheffe