Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exact phrase search using Lucene.net

I am having trouble searching for an exact phrase using Lucene.NET 2.0.0.4

For example I am searching for "scope attribute sets the variable" (including quotes) but receive no matches, I have confirmed 100% that the phrase exists.

Can anyone suggest where I am going wrong? Is this even supported with Lucene.NET? As usual the API documentation is not too helpful and a few CodeProject articles I've read don't specifically touch on this.

Using the following code to create the index:

Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory("Index", true);

Analyzer analyzer = new Lucene.Net.Analysis.SimpleAnalyzer();

IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer,true);

//create a document, add in a single field
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();

Lucene.Net.Documents.Field fldContent = new Lucene.Net.Documents.Field(
    "content", File.ReadAllText(@"Documents\100.txt"),
    Lucene.Net.Documents.Field.Store.YES,
    Lucene.Net.Documents.Field.Index.TOKENIZED);

doc.Add(fldContent);

//write the document to the index
indexWriter.AddDocument(doc);

I then search for a phrase using:

//state the file location of the index
Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory("Index", false);

//create an index searcher that will perform the search
IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(dir);

QueryParser qp = new QueryParser("content", new SimpleAnalyzer());

// txtSearch.Text  Contains a phrase such as "this is a phrase" 
Query q=qp.Parse(txtSearch.Text);  


//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(q);

The target document is about 7 MB plain text.

I have seen this previous question however I don't want a proximity search, just an exact phrase search.

like image 307
Ash Avatar asked May 12 '09 02:05

Ash


People also ask

How do you search in Lucene?

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.

What is Lucene full-text search?

Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.

Why Lucene is so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.


2 Answers

Shashikant Kore is correct with his answer, you need to enable term positions...

However, I would recommend not storing the text of the document in the field unless you absolutely need it to return back to you in the search results... Setting the store to 'NO' might help reduce the size of your index a bit.

Lucene.Net.Documents.Field fldContent = 
    new Lucene.Net.Documents.Field("content", 
        File.ReadAllText(@"Documents\100.txt"),
    Lucene.Net.Documents.Field.Store.NO,
    Lucene.Net.Documents.Field.Index.TOKENIZED, 
    Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS);
like image 196
josefresno Avatar answered Oct 27 '22 15:10

josefresno


You have not enabled the term positions. Creating field as follows should solve your problem.

Lucene.Net.Documents.Field fldContent = 
    new Lucene.Net.Documents.Field("content", 
        File.ReadAllText(@"Documents\100.txt"),
    Lucene.Net.Documents.Field.Store.YES,
    Lucene.Net.Documents.Field.Index.TOKENIZED, 
    Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS);
like image 29
Shashikant Kore Avatar answered Oct 27 '22 13:10

Shashikant Kore