Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing words with apostrophe in Lucene index

I've a company field in Lucene Index. One of the company names indexed is : Moody's

When user types in any of the following keywords,I want this company to come up in search results. 1.Moo 2.Mood 3.Moodys 4.Moody's

How should I store this index in Lucene and what type of Lucene Query should I use to get this behaviour?

Thanks.

like image 966
Jimmy Avatar asked Jul 27 '09 21:07

Jimmy


2 Answers

Based on your clarifications, I want to divide your question into two, and answer each in turn:

  1. How do I index words with apostrophes as equivalent to similar words without an apostrophe? e.g. mapping Moodys and Moody's to the same index term.
  2. How do I implement auto-complete search in Lucene - i.e. given an index, find documents using word prefixes, e.g. map Moo to Moodys ?

1 is relatively easy - Use a StandardToeknizer to create a token combining the apostrophe and s with the previous word, then a StandardFilter to remove the apostrophe and s. This will convert Moody's to Moody. A StandardAnalyzer does this and much more (lowercasing and stop word removal), which may be more than you need. Using a stemmer should take both Moodys and Moody to the same token. Try SnowBallFilter for this.

2 is harder: Lucene's PrefixQuery, to which Alan alluded, will only work when the company name is the first word in a field. You need something like the answer to this question about auto-complete in Lucene.

like image 87
Yuval F Avatar answered Nov 05 '22 15:11

Yuval F


The StandardAnalyser should work for 3 and 4, however won't work for 1 and 2.

Without writing your own (complex) text analyser, I would think about how you're expecting company names to be searched for. For example, basic lucene search syntax means that you could find "Moody's" if you search using wildcards: "Moo*" and "Mood*". Therefore, you might want to consider appending an "*" to the search term before submitting to lucene, however this might cause some confusion if the user isn't aware of this wildcard addition under the hood.

like image 33
Alan Avatar answered Nov 05 '22 15:11

Alan