I would like to sort a list of values somewhat along the lines of:
- 4
- 5xa
- 8kdjfew454
- 9
- 10
- 999cc
- b
- c9
- c10cc
- c11
In other words, what is sometimes referred to as "natural sorting", where text is sorted alphabetically/lexicographically where there is text, but numerically where there are numbers, even if both are mixed in the same string.
I can't find anyway to do this in Solr (4.0 atm). Is there standard way to do this or at least a workable "recipe" ?
The closest thing you can achieve is described in this article
From the article:
To force numbers to sort numerically, we need to left-pad any numbers with zeroes: 2 becomes 0002, 10 becomes 0010, 100 becomes 0100, et cetera. Then even a lexical sort will arrange values like this:
Title No. 1 Title No. 2 Title No. 10 Title No. 100
The Field Type
This alphanumeric sort field type converts any numbers found to 6 digits, padded with zeroes. (If you expect numbers larger than 6 digits in your field values, you will need to increase the number of zeroes when padding.)
The field type also removes English and French leading articles, lowercases, and purges any character that isn’t alphanumeric. It is English-centric, and assumes that diacritics have been folded into ASCII characters.
<fieldType name="alphaNumericSort" class="solr.TextField" sortMissingLast="false" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- Remove leading articles -->
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(a |the |les |la |le |l'|de la |du |des )" replacement="" replace="all"
/>
<!-- Left-pad numbers with zeroes -->
<filter class="solr.PatternReplaceFilterFactory"
pattern="(\d+)" replacement="00000$1" replace="all"
/>
<!-- Left-trim zeroes to produce 6 digit numbers -->
<filter class="solr.PatternReplaceFilterFactory"
pattern="0*([0-9]{6,})" replacement="$1" replace="all"
/>
<!-- Remove all but alphanumeric characters -->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z0-9])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
Sample output
Title No. 1 => titleno000001 Title No. 2 => titleno000002
Title No. 10 => titleno000010
Title No. 100 => titleno000100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With