Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene - Exact string matching

I'm trying to create a Lucene 4.10 index. I just want to save in the index the exact strings that I put into the document, witout tokenization.

I'm using the StandardAnalyzer.

    Directory dir = FSDirectory.open(new File("myDire"));
    Analyzer analyzer = new StandardAnalyzer();
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);
    iwc.setOpenMode(OpenMode.CREATE);
    IndexWriter writer = new IndexWriter(dir, iwc);
    StringField field1 = new StringField("1", content1, Store.YES);
    StringField field2 = new StringField("2", content2, Store.YES);
    StringField field3 = new StringField("3", content3, Store.YES);
    doc.add(field1);
    doc.add(field2);
    doc.add(field3);
    writer.addDocument(doc, analyzer);
    writer.close();

If I print the index's content, I can see my data being stored, for example, my document has this "field 3":

    stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<3:"Fuel Tank Capacity"@en>

I'm trying to query the index in order to get it back:

    IndexSearcher searcher = new IndexSearcher(reader);
    Analyzer analyzer = new StandardAnalyzer();
    QueryParser parser = new QueryParser("3", analyzer);
    String queryString = "\"\"Fuel Tank Capacity"\@en\"";
    Query query = parser.createPhraseQuery("3", QueryParser.escape(queryString));
    TopDocs docs = searcher.search(query, null, 20);

I'm trying to search the term "Fuel Tank Capacity"@en (quotation marks included) so I tried to escape them and I put another couple of quotes around the terms in order to let lucene understand that I'm searching for the entire texts.

If I print the query, I get: 3:"fuel tank capacity en" but I dont want to split the text on the @ symbol.

I think that my first problem is the StandardAnalyzer, because it seems to tokenize, if I'm not mistaken. However, I cannot understand how to query the index in order to get exactly "Fuel Tank Capacity"@en (quotation marks included).

Thank you

like image 966
LucaT Avatar asked Sep 12 '14 13:09

LucaT


People also ask

Are Lucene’s patterns anchored?

Howeer, Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string abcde: Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped.

Does Lucene support * as the first character in a search?

Note that Lucene doesn't support using a * symbol as the first character of a search. Lucene supports finding words are a within a specific distance away. Search for "foo bar" within 4 words from each other. Note that for proximity searches, exact matches are proximity zero, and word transpositions (bar foo) are proximity 1.

How do I search for a specific field in Lucene?

However, Lucene syntax is not able to search nested objects or scripted fields. To perform a free text search, simply enter a text string. For example, if you’re searching web server logs, you could enter safari to search all fields:

What is the Lucene syntax used for?

The full Lucene syntax is used for query expressions passed in the search parameter of the Search Documents API, not to be confused with the OData syntax used for the $filter parameter of that API. These different syntaxes have their own rules for constructing queries, escaping strings, and so on.


1 Answers

You could simplify matters, and just cut the QueryParser out of the equation entirely. Since you are using a StringField, the whole content of the field is a single term, so a simple TermQuery should work well:

Query query = new TermQuery(new Term("3","\"Fuel Tank Capacity\"@en"));
like image 194
femtoRgon Avatar answered Sep 19 '22 05:09

femtoRgon