Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

solr StandardTokenizer: how is underscore processed with wildcards?

Tags:

java

solr

So, I have a Solr instance which processes inputs and queries using StandardTokenizer (as well as ClassicFilterfactory, LowercaseFilterFactory and Stopfilterfactory).

In my index are a number of files with underscore separated names (eg. some_indexed_file.jpg).

I've noticed that if I query for some_indexed_file.jpg, I get the file I'm looking for returned correctly.

However, if I alternatively search for some_indexed_file.jp*, (that's with an asterisk, which I am presuming is acting as a wildcard) which, by my understanding should produce similar results, I get no results.

Any idea what's going on: I assume I'm misunderstanding something about the way solr processes queries?

Edit: as requested, here are the schema XML configuration entries:

    <fieldType name="default" class="solr.TextField">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.ClassicFilterFactory" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.StopFilterFactory" />
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.ClassicFilterFactory" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.StopFilterFactory" />
        </analyzer>
    </fieldType>



   <field name="filename" type="default" multiValued="true" omitNorms="false" termVectors="false"/>
like image 996
RoryB Avatar asked Nov 04 '22 11:11

RoryB


1 Answers

Well, a bit more research has solved the problem: The base issue is that Solr doesn't apply text analysis to wildcard queries.

This meant that it was searching for an exact match to some_indexed_file.jp*. However, when the filename was indexed, it was tokenised into "some" "indexed" and file.jpg, which does not match this search term.
Searching for some_indexed_file.jpg was being tokenised properly, and therefore returning the right results.

like image 54
RoryB Avatar answered Nov 12 '22 19:11

RoryB