So, I have a Solr instance which processes inputs and queries using StandardTokenizer
(as well as ClassicFilterfactory
, LowercaseFilterFactory
and Stopfilterfactory
).
In my index are a number of files with underscore separated names (eg. some_indexed_file.jpg
).
I've noticed that if I query for some_indexed_file.jpg
, I get the file I'm looking for returned correctly.
However, if I alternatively search for some_indexed_file.jp*
, (that's with an asterisk, which I am presuming is acting as a wildcard) which, by my understanding should produce similar results, I get no results.
Any idea what's going on: I assume I'm misunderstanding something about the way solr processes queries?
Edit: as requested, here are the schema XML configuration entries:
<fieldType name="default" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
</analyzer>
</fieldType>
<field name="filename" type="default" multiValued="true" omitNorms="false" termVectors="false"/>
Well, a bit more research has solved the problem: The base issue is that Solr doesn't apply text analysis to wildcard queries.
This meant that it was searching for an exact match to some_indexed_file.jp*
. However, when the filename was indexed, it was tokenised into "some" "indexed" and file.jpg
, which does not match this search term.
Searching for some_indexed_file.jpg
was being tokenised properly, and therefore returning the right results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With