Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SOLR 4.0 alphabetical sorting trouble

Tags:

solr

I'm having a hard time of getting my head around an issue I have with my SOLR address database.

I built this one up from the example files. I'm basically running the example configuration with a modified schema.

schema.xml:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" />

<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" />
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" />
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />

I am populating the database by pushing about 20.000 random test datasets like the following to post.jar:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<add>
    <doc>
        <field name="id">1352498443_1</field>
        <field name="givenname_s">Aynur</field>
        <field name="middleinitial_s"/>
        <field name="surname_s">Lehnen</field>
        <field name="gender_s">F</field>
        <field name="pictureuri_s">dummy_assets/female.jpg</field>
        <field name="function_s">Zugschaffner/in</field>
        <field name="organizationalunit_s">P 07</field>
        <field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field>
        <field name="company_s">Lorem Lagna Epsum Emet</field>
        <field name="street_s">Erlenweg</field>
        <field name="streetnumber_s">82</field>
        <field name="postcode_s">76297</field>
        <field name="city_s">Lübeck</field>
        <field name="building_s"/>
        <field name="roomnumber_s">242</field>
        <field name="country_s">GERMANY</field>
        <field name="countrycode_s">DE</field>
        <field name="emailaddress_s">[email protected]</field>
        <field name="phone1_s">0392984823</field>
        <field name="phone2_s">0124111417</field>
        <field name="mobile_s">0325117132</field>
        <field name="fax_s">0171459177</field>
    </doc>
</add>

However when retreiving data I seem to have problems with alphabetical sorting. Consider the folowing query:

{
    "responseHeader": {
        "status": 0,
            "QTime": 5,
            "params": {
            "sort": "surname_s asc",
                "fl": "surname_s",
                "indent": "true",
                "wt": "json",
                "q": "city_s:berlin"
        }
    },
        "response": {
        "numFound": 1094,
        "start": 0,
        "docs": [{
            "surname_s": "Weil"
        }, {
            "surname_s": "Abel"
        }, {
            "surname_s": "Adam"
        }, {
            "surname_s": "Ade"
        }, {
            "surname_s": "Adrian"
        }, {
            "surname_s": "Aigner"
        }, {
            "surname_s": "Aigner"
        }, {
            "surname_s": "Alber"
        }, {
            "surname_s": "Alber"
        }, {
            "surname_s": "Albers"
        }]
    }
}

Why is "Weil" on position one, while the rest of the data appears to be sorted correctly?

like image 393
mritz_p Avatar asked Nov 13 '12 12:11

mritz_p


1 Answers

I believe that some of the additional analyzers that are being applied in the text_de field type are the cause for this sorting behavior. In my experience, for the best results when sorting strings is to use the alphaOlySort fieldType that comes with the example schema.xml shown below.

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <!-- KeywordTokenizer does no actual tokenizing, so the entire
         input string is preserved as a single token
      -->
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- The LowerCase TokenFilter does what you expect, which can be
         when you want your sorting to be case insensitive
      -->
    <filter class="solr.LowerCaseFilterFactory" />
    <!-- The TrimFilter removes any leading or trailing whitespace -->
    <filter class="solr.TrimFilterFactory" />
    <!-- The PatternReplaceFilter gives you the flexibility to use
         Java Regular expression to replace any sequence of characters
         matching a pattern with an arbitrary replacement string, 
         which may include back references to portions of the original
         string matched by the pattern.

         See the Java Regular Expression documentation for more
         information on pattern and replacement string syntax.

         http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
      -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="([^a-z])" replacement="" replace="all"
    />
  </analyzer>
</fieldType>

I would recommend creating a new field and then copying the value from surname_s via copyField, something like the following:

 <field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" />

 <copyField source="surname_s" dest="surname_s_sort"/>

Note: there is not any need to store the value in the surname_s_sort field, hence the stored="false" attribute, unless you expect to display that to the users.

Then you can just change your query to sort on the surname_s_sort instead.

like image 125
Paige Cook Avatar answered Oct 30 '22 16:10

Paige Cook