Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Lucene work with quotes and wildcards

When I search in lucene for the Dutch word bieten is their a difference between the following: bieten, "bieten", "*bieten*" and *bieten* when using the DutchAnalyzer and allowing leading wildcards?

Because as far I can find in thee parser syntax the quotes are there just to handle spaces and all words are always search like their are wildcards around them.

The reason I ask this question because I found out that by using the DutchAnalyzer all words are striped of their plural before being entered in the index. Which in my case means biet is stored in the index and not bieten. And when searching with bieten or "bieten" or "bieten" it also modifies the query to biet.
But when I'm using *bieten* the query doesn't change and stays a plural. Which doesn't give any results.
So

  bieten   -->> biet 
 "bieten"  -->> biet
"*bieten*" -->> biet 
 *bieten*  -->> *bieten*

Why is the last search translated to a different query then the others.

Queryparser syntax: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html
Screenshot Lucene: http://oi63.tinypic.com/1z5krdg.jpg

like image 515
Jeroen Avatar asked Mar 04 '16 13:03

Jeroen


People also ask

How does Lucene Query work?

Simply put, Lucene uses an “inverted indexing” of data – instead of mapping pages to keywords, it maps keywords to pages just like a glossary at the end of any book. This allows for faster search responses, as it searches through an index, instead of searching through text directly.

How do you find special characters in Lucene?

Lucene supports escaping special characters that are part of the query syntax. To escape a special character, precede the character with a backslash ( \ ).

Why Lucene is so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

What are Lucene special characters?

Special characters( ) { } [ ] ^ “ ~ * ? : \ / are reserved for the Lucene Query String parser, so you'll need to escape them with \ before the character if you need to use it. For example, f-150 should be wrapped up as f\-150 , or wrapped inside double quotes as "f-150" .


1 Answers

Wildcard, regex and fuzzy queries are not analyzed by the query parser, that's why it's different.

Words are definitely not searched with wildcards around them. The query *bieten* would be intended to match things like "xxbietenxx". Finding words within a sentence does not involve wildcards, though. That's what analysis is for. It splits the text into single-word terms.

To explain each of those queries:

  • bieten - Simple term query. Search for the given word.
  • "bieten" - Phrase query. Analyze and find the given multi-term phrase. In this case the phrase is one term long, and so the same as a term query.
  • "*bieten*" - Again, phrase query. Not a wildcard query in any way. You can't use wildcards in phrases. The analyzer will remove the punctuation, making this identical to the last one.
  • *bieten* - Wildcard query. This will match "bietenxx", "xxbieten", and "xxbietenxx", but will not be analyzed, and so won't match the post-analysis term "biet".
like image 104
femtoRgon Avatar answered Nov 24 '22 04:11

femtoRgon