Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Full Text Search like Google

I would like to implement full-text-search in my off-line (android) application to search the user generated list of notes.

I would like it to behave just like Google (since most people are already used to querying to Google)

My initial requirements are:

  • Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.
  • Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)
  • Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.
  • Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.
  • Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.
  • Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.
  • Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.
  • Search all fields (in multifield documents) if no field specified (not just a default field)
  • Auto-completion in search results while typing to give popular searches. (just like Google Suggest)

How may I configure a full-text-search engine to behave as much as possible as Google?

(I am mostly interested in Open Source, Java and in particular Lucene)

like image 876
Eduardo Avatar asked Dec 30 '09 00:12

Eduardo


1 Answers

I think Lucene can address your requirements. You should also consider using Solr, which has similar functionality and is much easier to set up.

I will discuss each requirement separately, using Lucene. I believe Solr has similar mechanisms.

  • Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.

This is a reasonable index size both for Lucene and Solr, enabling retrieval at several tens of milliseconds per query.

  • Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)

You can do that using a BooleanQuery with MUST as default in Lucene.

The next four requirements can be handled by customizing a Lucene Analyzer:

  • Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.

A LowerCaseFilter can be used for this.

  • Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.

This requires Unicode normalization followed by diacritic removal. You can build a custom Analyzer for this.

  • Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.

A StopFilter removes stop words in Lucene.

  • Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.

Lucene has many Snowball Stemmers. One of them may be appropriate.

  • Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.

This is covered by the Lucene PhraseQuery specialized query.

As you can see, Lucene covers all of the required functionality. To get a more general picture, I suggest the book Lucene in Action, The Apache Lucene Wiki or The Lucid Imagination Site.

like image 163
Yuval F Avatar answered Sep 29 '22 14:09

Yuval F