Lucene: Multi-word phrases as search terms

Tags:

I'm trying to make a searchable phone/local business directory using Apache Lucene.

I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.

I'm indexing the data with the following:

String LocationOfDirectory = "C:\\dir\\index";

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);


Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);

w.add(doc);
w.close();

My searches work like this:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:

String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
    p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);

However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and @), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and @ are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.

I'm beginning to go slightly mad, does anyone know what I'm doing wrong?

595

asked Jan 30 '12 15:01

RikSaunderson

2 Answers

The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.

KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.

Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.

140

answered Sep 20 '22 19:09

Artur Nowak

I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.

This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.

However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.

In the end, the correct solution was the following:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);

Query q = qp.parse("grove road");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

answered Sep 18 '22 19:09

RikSaunderson

Related questions
                            
                                Cast a list of concrete type to a list of its interfaces in Java
                            
                                Why do we sometimes separate behaviour from classes in Java [closed]
                            
                                ClassCastException when casting looked-up EJB view in AS7
                            
                                What is the difference between JFrame.getContentPane() and JFrame.getRootPane()?
                            
                                How do I prevent browser caching with Play?
                            
                                How to make code run only if an exception was thrown?
                            
                                ID generator for the Objects created
                            
                                Hibernate MS-SQL Wrong column type Found: decimal, expected: float
                            
                                Java - find tzdata version in use regardless of JRE version
                            
                                How can I synchronize getter while a setter is working in Java
                            
                                Postgresql transaction handling with java
                            
                                Why is this code not thread safe?
                            
                                Antlr IDE in Eclipse doesn't work
                            
                                how to hide background of JButton (which containt icon image)?
                            
                                Can we have multiple Java SDKs in one machine?
                            
                                HQL Where IN for empty list crashes
                            
                                Pad a binary String equal to zero ("0") with leading zeros in Java
                            
                                Why does the following code translate to a new + dup op instructions in java bytecode?
                            
                                Java KeyListener Not Registering Arrow Keys
                            
                                method annotations null when proxying via CGLIB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Lucene: Multi-word phrases as search terms

Tags:

java

search

lucene

RikSaunderson

People also ask

2 Answers

Artur Nowak

RikSaunderson

Recent Activity

Donate For Us