Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort text/string in solr using a natural sort order?

I would like to sort a list of values somewhat along the lines of:

  • 4
  • 5xa
  • 8kdjfew454
  • 9
  • 10
  • 999cc
  • b
  • c9
  • c10cc
  • c11

In other words, what is sometimes referred to as "natural sorting", where text is sorted alphabetically/lexicographically where there is text, but numerically where there are numbers, even if both are mixed in the same string.

I can't find anyway to do this in Solr (4.0 atm). Is there standard way to do this or at least a workable "recipe" ?

like image 901
Gus Avatar asked Nov 13 '22 10:11

Gus


1 Answers

The closest thing you can achieve is described in this article

From the article:

To force numbers to sort numerically, we need to left-pad any numbers with zeroes: 2 becomes 0002, 10 becomes 0010, 100 becomes 0100, et cetera. Then even a lexical sort will arrange values like this:

Title No. 1 Title No. 2 Title No. 10 Title No. 100

The Field Type

This alphanumeric sort field type converts any numbers found to 6 digits, padded with zeroes. (If you expect numbers larger than 6 digits in your field values, you will need to increase the number of zeroes when padding.)

The field type also removes English and French leading articles, lowercases, and purges any character that isn’t alphanumeric. It is English-centric, and assumes that diacritics have been folded into ASCII characters.

<fieldType name="alphaNumericSort" class="solr.TextField" sortMissingLast="false" omitNorms="true">
  <analyzer>
    <!-- KeywordTokenizer does no actual tokenizing, so the entire
         input string is preserved as a single token
      -->
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- The LowerCase TokenFilter does what you expect, which can be
         when you want your sorting to be case insensitive
      -->
    <filter class="solr.LowerCaseFilterFactory" />
    <!-- The TrimFilter removes any leading or trailing whitespace -->
    <filter class="solr.TrimFilterFactory" />
    <!-- Remove leading articles -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="^(a |the |les |la |le |l'|de la |du |des )" replacement="" replace="all"
    />
    <!-- Left-pad numbers with zeroes -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="(\d+)" replacement="00000$1" replace="all"
    />
    <!-- Left-trim zeroes to produce 6 digit numbers -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="0*([0-9]{6,})" replacement="$1" replace="all"
    />
    <!-- Remove all but alphanumeric characters -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="([^a-z0-9])" replacement="" replace="all"
    />
  </analyzer>
</fieldType>

Sample output

Title No. 1 => titleno000001 Title No. 2 => titleno000002
Title No. 10 => titleno000010
Title No. 100 => titleno000100

like image 185
Sharun Avatar answered Jan 04 '23 02:01

Sharun