I am having trouble searching for an exact phrase using Lucene.NET 2.0.0.4
For example I am searching for "scope attribute sets the variable" (including quotes) but receive no matches, I have confirmed 100% that the phrase exists.
Can anyone suggest where I am going wrong? Is this even supported with Lucene.NET? As usual the API documentation is not too helpful and a few CodeProject articles I've read don't specifically touch on this.
Using the following code to create the index:
Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory("Index", true);
Analyzer analyzer = new Lucene.Net.Analysis.SimpleAnalyzer();
IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer,true);
//create a document, add in a single field
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldContent = new Lucene.Net.Documents.Field(
"content", File.ReadAllText(@"Documents\100.txt"),
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED);
doc.Add(fldContent);
//write the document to the index
indexWriter.AddDocument(doc);
I then search for a phrase using:
//state the file location of the index
Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory("Index", false);
//create an index searcher that will perform the search
IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(dir);
QueryParser qp = new QueryParser("content", new SimpleAnalyzer());
// txtSearch.Text Contains a phrase such as "this is a phrase"
Query q=qp.Parse(txtSearch.Text);
//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(q);
The target document is about 7 MB plain text.
I have seen this previous question however I don't want a proximity search, just an exact phrase search.
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.
Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.
Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
Shashikant Kore is correct with his answer, you need to enable term positions...
However, I would recommend not storing the text of the document in the field unless you absolutely need it to return back to you in the search results... Setting the store to 'NO' might help reduce the size of your index a bit.
Lucene.Net.Documents.Field fldContent =
new Lucene.Net.Documents.Field("content",
File.ReadAllText(@"Documents\100.txt"),
Lucene.Net.Documents.Field.Store.NO,
Lucene.Net.Documents.Field.Index.TOKENIZED,
Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS);
You have not enabled the term positions. Creating field as follows should solve your problem.
Lucene.Net.Documents.Field fldContent =
new Lucene.Net.Documents.Field("content",
File.ReadAllText(@"Documents\100.txt"),
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED,
Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With