I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters. It works for some words that contain "-" but not for all and I don't find the reason, why it's not working. The field I'm searching in, is analyzed and contains version of the word with and without the "-" character. I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer here an example: if I search for "gsx-*" I got a result, the indexed field contains "SUZUKI GSX-R 1000 GSX-R1000 GSXR" but if I search for "v-*" I got no result. The indexed field of the expected result contains: "SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM" If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop) So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?

StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query <code>"gsx-*"</code> into <code>"gsx*"</code> and <code>"v-*"</code> into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field. So what you want is for "v-strom" as a whole to be an indexed term. <code>StandardAnalyzer</code> is not suited to this kind of text. Maybe have a go with the <code>WhitespaceAnalyzer</code> or <code>SimpleAnalyzer</code>. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further <code>TokenFilters</code>. A very good explanation is given in the Lucene Analysis package Javadoc. BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.

Lucene Index problems with "-" character

2 Answers

StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.

So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.

BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.

136

answered Oct 11 '22 19:10

Marko Topolnik

ClassicAnalyzer handles '-' as a useful, non-delimiter character. As I understand ClassicAnalyzer, it handles '-' like the pre-3.1 StandardAnalyzer because ClassicAnalyzer uses ClassicTokenizer which treats numbers with an embedded '-' as a product code, so the whole thing is tokenized as one term.

When I was at Regenstrief Institute I noticed this after upgrading Luke, as the LOINC standard medical terms (LOINC was initiated by R.I.) are identified by a number followed by a '-' and a checkdigit, like '1-8' or '2857-1'. My searches for LOINCs like '45963-6' failed using StandardAnalyzer in Luke 3.5.0, but succeeded with ClassicAnalyzer (and this was because we built the index with the 2.9.2 Lucene.NET).

answered Oct 11 '22 20:10

Mark Leighton Fisher

Related questions
                            
                                Can I mock a super class method call?
                            
                                When does a java object become non-null during construction?
                            
                                Java: How to get the thumbnail from a file
                            
                                Precise time measurement in Java
                            
                                java cpu usage monitoring
                            
                                Is there a concise way to create an InputSupplier for an InputStream in Google Guava?
                            
                                Java getMethod with superclass parameters in method
                            
                                Best practice for naming unit and integration test methods?
                            
                                protected data in abstract class
                            
                                How do I use Blender models in Java?
                            
                                RESTEasy - @Path requiring a full path?
                            
                                How to create a table in Android with multiple columns?
                            
                                Java classloaders: why search the parent classloader first?
                            
                                Apache POI XSSF reading in excel files
                            
                                Immutable Value objects and JPA
                            
                                wait until wifi connected on android
                            
                                Joda parse ISO8601 date in GMT timezone
                            
                                java code samples illustrating usage of amazon dynamo db [closed]
                            
                                Java EE 6: Target Unreachable, identifier 'helloBean' resolved to null [duplicate]
                            
                                How to create JasperReport object from compiled .jasper file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Lucene Index problems with "-" character

Tags:

java

indexing

escaping

character

lucene

Zteve

People also ask

2 Answers

Marko Topolnik

Mark Leighton Fisher

Recent Activity

Donate For Us