Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene and Special Characters

I am using Lucene.Net 2.0 to index some fields from a database table. One of the fields is a 'Name' field which allows special characters. When I perform a search, it does not find my document that contains a term with special characters.

I index my field as such:

Directory DALDirectory = FSDirectory.GetDirectory(@"C:\Indexes\Name", false);
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(DALDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();
doc.Add(new Field("Name", "Test (Test)", Field.Store.YES, Field.Index.TOKENIZED));
indexWriter.AddDocument(doc);

indexWriter.Optimize();
indexWriter.Close();

And I search doing the following:

value = value.Trim().ToLower();
value = QueryParser.Escape(value);

Query searchQuery = new TermQuery(new Term(field, value));
Searcher searcher = new IndexSearcher(DALDirectory);

TopDocCollector collector = new TopDocCollector(searcher.MaxDoc());
searcher.Search(searchQuery, collector);
ScoreDoc[] hits = collector.TopDocs().scoreDocs;

If I perform a search for field as 'Name' and value as 'Test', it finds the document. If I perform the same search as 'Name' and value as 'Test (Test)', then it does not find the document.

Even more strange, if I remove the QueryParser.Escape line do a search for a GUID (which, of course, contains hyphens) it finds documents where the GUID value matches, but performing the same search with the value as 'Test (Test)' still yields no results.

I am unsure what I am doing wrong. I am using the QueryParser.Escape method to escape the special characters and am storing the field and searching by the Lucene.Net's examples.

Any thoughts?

like image 250
Brandon Avatar asked Apr 28 '10 20:04

Brandon


People also ask

How do you use the wildcard in Lucene?

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.

Is Lucene case sensitive?

Lucene search is case-sensitive, but all input is usually lowercased when passing through QueryParser, so it feels like it is case insensitive (This is the case of the findBySimpleQuery() method. In other words, don't lowercase your input before indexing, and don't lowercase your queries.

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.


1 Answers

StandardAnalyzer strips out the special characters during indexing. You can pass in a list of explicit stopwords (excluding the ones you want in).

like image 103
Mikos Avatar answered Oct 05 '22 14:10

Mikos