Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural (human alpha-numeric) sort in Microsoft SQL 2005

People also ask

How do I select numeric values from an alphanumeric column in SQL?

You can take use advantage of ISNUMERIC() function in SQL Server .

How do I sort numbers in SQL?

The ORDER BY statement in SQL is used to sort the fetched data in either ascending or descending according to one or more columns. By default ORDER BY sorts the data in ascending order. We can use the keyword DESC to sort the data in descending order and the keyword ASC to sort in ascending order.


order by LEN(value), value

Not perfect, but works well in a lot of cases.


Most of the SQL-based solutions I have seen break when the data gets complex enough (e.g. more than one or two numbers in it). Initially I tried implementing a NaturalSort function in T-SQL that met my requirements (among other things, handles an arbitrary number of numbers within the string), but the performance was way too slow.

Ultimately, I wrote a scalar CLR function in C# to allow for a natural sort, and even with unoptimized code the performance calling it from SQL Server is blindingly fast. It has the following characteristics:

  • will sort the first 1,000 characters or so correctly (easily modified in code or made into a parameter)
  • properly sorts decimals, so 123.333 comes before 123.45
  • because of above, will likely NOT sort things like IP addresses correctly; if you wish different behaviour, modify the code
  • supports sorting a string with an arbitrary number of numbers within it
  • will correctly sort numbers up to 25 digits long (easily modified in code or made into a parameter)

The code is here:

using System;
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;

public class UDF
{
    [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic=true)]
    public static SqlString Naturalize(string val)
    {
        if (String.IsNullOrEmpty(val))
            return val;

        while(val.Contains("  "))
            val = val.Replace("  ", " ");

        const int maxLength = 1000;
        const int padLength = 25;

        bool inNumber = false;
        bool isDecimal = false;
        int numStart = 0;
        int numLength = 0;
        int length = val.Length < maxLength ? val.Length : maxLength;

        //TODO: optimize this so that we exit for loop once sb.ToString() >= maxLength
        var sb = new StringBuilder();
        for (var i = 0; i < length; i++)
        {
            int charCode = (int)val[i];
            if (charCode >= 48 && charCode <= 57)
            {
                if (!inNumber)
                {
                    numStart = i;
                    numLength = 1;
                    inNumber = true;
                    continue;
                }
                numLength++;
                continue;
            }
            if (inNumber)
            {
                sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
                inNumber = false;
            }
            isDecimal = (charCode == 46);
            sb.Append(val[i]);
        }
        if (inNumber)
            sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));

        var ret = sb.ToString();
        if (ret.Length > maxLength)
            return ret.Substring(0, maxLength);

        return ret;
    }

    static string PadNumber(string num, bool isDecimal, int padLength)
    {
        return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
    }
}

To register this so that you can call it from SQL Server, run the following commands in Query Analyzer:

CREATE ASSEMBLY SqlServerClr FROM 'SqlServerClr.dll' --put the full path to DLL here
go
CREATE FUNCTION Naturalize(@val as nvarchar(max)) RETURNS nvarchar(1000) 
EXTERNAL NAME SqlServerClr.UDF.Naturalize
go

Then, you can use it like so:

select *
from MyTable
order by dbo.Naturalize(MyTextField)

Note: If you get an error in SQL Server along the lines of Execution of user code in the .NET Framework is disabled. Enable "clr enabled" configuration option., follow the instructions here to enable it. Make sure you consider the security implications before doing so. If you are not the db admin, make sure you discuss this with your admin before making any changes to the server configuration.

Note2: This code does not properly support internationalization (e.g., assumes the decimal marker is ".", is not optimized for speed, etc. Suggestions on improving it are welcome!

Edit: Renamed the function to Naturalize instead of NaturalSort, since it does not do any actual sorting.


I know this is an old question but I just came across it and since it's not got an accepted answer.

I have always used ways similar to this:

SELECT [Column] FROM [Table]
ORDER BY RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))), 1000)

The only common times that this has issues is if your column won't cast to a VARCHAR(MAX), or if LEN([Column]) > 1000 (but you can change that 1000 to something else if you want), but you can use this rough idea for what you need.

Also this is much worse performance than normal ORDER BY [Column], but it does give you the result asked for in the OP.

Edit: Just to further clarify, this the above will not work if you have decimal values such as having 1, 1.15 and 1.5, (they will sort as {1, 1.5, 1.15}) as that is not what is asked for in the OP, but that can easily be done by:

SELECT [Column] FROM [Table]
ORDER BY REPLACE(RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))) + REPLICATE('0', 100 - CHARINDEX('.', REVERSE(LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX))))), 1)), 1000), '.', '0')

Result: {1, 1.15, 1.5}

And still all entirely within SQL. This will not sort IP addresses because you're now getting into very specific number combinations as opposed to simple text + number.


Here's a solution written for SQL 2000. It can probably be improved for newer SQL versions.

/**
 * Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
 *
 * @author Alexandre Potvin Latreille (plalx)
 * @param {nvarchar(4000)} string The formatted string.
 * @param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
 * @param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
 *
 * @return {nvarchar(4000)} A string for natural sorting.
 * Example of use: 
 * 
 *      SELECT Name FROM TableA ORDER BY Name
 *  TableA (unordered)              TableA (ordered)
 *  ------------                    ------------
 *  ID  Name                        ID  Name
 *  1.  A1.                         1.  A1-1.       
 *  2.  A1-1.                       2.  A1.
 *  3.  R1             -->          3.  R1
 *  4.  R11                         4.  R11
 *  5.  R2                          5.  R2
 *
 *  
 *  As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
 *  We can use this function to fix this.
 *
 *      SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
 *  TableA (unordered)              TableA (ordered)
 *  ------------                    ------------
 *  ID  Name                        ID  Name
 *  1.  A1.                         1.  A1.     
 *  2.  A1-1.                       2.  A1-1.
 *  3.  R1              -->         3.  R1
 *  4.  R11                         4.  R2
 *  5.  R2                          5.  R11
 */
ALTER FUNCTION [dbo].[udf_NaturalSortFormat](
    @string nvarchar(4000),
    @numberLength int = 10,
    @sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
    DECLARE @sortString varchar(4000),
        @numStartIndex int,
        @numEndIndex int,
        @padLength int,
        @totalPadLength int,
        @i int,
        @sameOrderCharsLen int;

    SELECT 
        @totalPadLength = 0,
        @string = RTRIM(LTRIM(@string)),
        @sortString = @string,
        @numStartIndex = PATINDEX('%[0-9]%', @string),
        @numEndIndex = 0,
        @i = 1,
        @sameOrderCharsLen = LEN(@sameOrderChars);

    -- Replace all char that have the same order by a space.
    WHILE (@i <= @sameOrderCharsLen)
    BEGIN
        SET @sortString = REPLACE(@sortString, SUBSTRING(@sameOrderChars, @i, 1), ' ');
        SET @i = @i + 1;
    END

    -- Pad numbers with zeros.
    WHILE (@numStartIndex <> 0)
    BEGIN
        SET @numStartIndex = @numStartIndex + @numEndIndex;
        SET @numEndIndex = @numStartIndex;

        WHILE(PATINDEX('[0-9]', SUBSTRING(@string, @numEndIndex, 1)) = 1)
        BEGIN
            SET @numEndIndex = @numEndIndex + 1;
        END

        SET @numEndIndex = @numEndIndex - 1;

        SET @padLength = @numberLength - (@numEndIndex + 1 - @numStartIndex);

        IF @padLength < 0
        BEGIN
            SET @padLength = 0;
        END

        SET @sortString = STUFF(
            @sortString,
            @numStartIndex + @totalPadLength,
            0,
            REPLICATE('0', @padLength)
        );

        SET @totalPadLength = @totalPadLength + @padLength;
        SET @numStartIndex = PATINDEX('%[0-9]%', RIGHT(@string, LEN(@string) - @numEndIndex));
    END

    RETURN @sortString;
END

I know this is a bit old at this point, but in my search for a better solution, I came across this question. I'm currently using a function to order by. It works fine for my purpose of sorting records which are named with mixed alpha numeric ('item 1', 'item 10', 'item 2', etc)

CREATE FUNCTION [dbo].[fnMixSort]
(
    @ColValue NVARCHAR(255)
)
RETURNS NVARCHAR(1000)
AS

BEGIN
    DECLARE @p1 NVARCHAR(255),
        @p2 NVARCHAR(255),
        @p3 NVARCHAR(255),
        @p4 NVARCHAR(255),
        @Index TINYINT

    IF @ColValue LIKE '[a-z]%'
        SELECT  @Index = PATINDEX('%[0-9]%', @ColValue),
            @p1 = LEFT(CASE WHEN @Index = 0 THEN @ColValue ELSE LEFT(@ColValue, @Index - 1) END + REPLICATE(' ', 255), 255),
            @ColValue = CASE WHEN @Index = 0 THEN '' ELSE SUBSTRING(@ColValue, @Index, 255) END
    ELSE
        SELECT  @p1 = REPLICATE(' ', 255)

    SELECT  @Index = PATINDEX('%[^0-9]%', @ColValue)

    IF @Index = 0
        SELECT  @p2 = RIGHT(REPLICATE(' ', 255) + @ColValue, 255),
            @ColValue = ''
    ELSE
        SELECT  @p2 = RIGHT(REPLICATE(' ', 255) + LEFT(@ColValue, @Index - 1), 255),
            @ColValue = SUBSTRING(@ColValue, @Index, 255)

    SELECT  @Index = PATINDEX('%[0-9,a-z]%', @ColValue)

    IF @Index = 0
        SELECT  @p3 = REPLICATE(' ', 255)
    ELSE
        SELECT  @p3 = LEFT(REPLICATE(' ', 255) + LEFT(@ColValue, @Index - 1), 255),
            @ColValue = SUBSTRING(@ColValue, @Index, 255)

    IF PATINDEX('%[^0-9]%', @ColValue) = 0
        SELECT  @p4 = RIGHT(REPLICATE(' ', 255) + @ColValue, 255)
    ELSE
        SELECT  @p4 = LEFT(@ColValue + REPLICATE(' ', 255), 255)

    RETURN  @p1 + @p2 + @p3 + @p4

END

Then call

select item_name from my_table order by fnMixSort(item_name)

It easily triples the processing time for a simple data read, so it may not be the perfect solution.


RedFilter's answer is great for reasonably sized datasets where indexing is not critical, however if you want an index, several tweaks are required.

First, mark the function as not doing any data access and being deterministic and precise:

[SqlFunction(DataAccess = DataAccessKind.None,
                          SystemDataAccess = SystemDataAccessKind.None,
                          IsDeterministic = true, IsPrecise = true)]

Next, MSSQL has a 900 byte limit on the index key size, so if the naturalized value is the only value in the index, it must be at most 450 characters long. If the index includes multiple columns, the return value must be even smaller. Two changes:

CREATE FUNCTION Naturalize(@str AS nvarchar(max)) RETURNS nvarchar(450)
    EXTERNAL NAME ClrExtensions.Util.Naturalize

and in the C# code:

const int maxLength = 450;

Finally, you will need to add a computed column to your table, and it must be persisted (because MSSQL cannot prove that Naturalize is deterministic and precise), which means the naturalized value is actually stored in the table but is still maintained automatically:

ALTER TABLE YourTable ADD nameNaturalized AS dbo.Naturalize(name) PERSISTED

You can now create the index!

CREATE INDEX idx_YourTable_n ON YourTable (nameNaturalized)

I've also made a couple of changes to RedFilter's code: using chars for clarity, incorporating duplicate space removal into the main loop, exiting once the result is longer than the limit, setting maximum length without substring etc. Here's the result:

using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;

public static class Util
{
    [SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
    public static SqlString Naturalize(string str)
    {
        if (string.IsNullOrEmpty(str))
            return str;

        const int maxLength = 450;
        const int padLength = 15;

        bool isDecimal = false;
        bool wasSpace = false;
        int numStart = 0;
        int numLength = 0;

        var sb = new StringBuilder();
        for (var i = 0; i < str.Length; i++)
        {
            char c = str[i];
            if (c >= '0' && c <= '9')
            {
                if (numLength == 0)
                    numStart = i;
                numLength++;
            }
            else
            {
                if (numLength > 0)
                {
                    sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));
                    numLength = 0;
                }
                if (c != ' ' || !wasSpace)
                    sb.Append(c);
                isDecimal = c == '.';
                if (sb.Length > maxLength)
                    break;
            }
            wasSpace = c == ' ';
        }
        if (numLength > 0)
            sb.Append(pad(str.Substring(numStart, numLength), isDecimal, padLength));

        if (sb.Length > maxLength)
            sb.Length = maxLength;
        return sb.ToString();
    }

    private static string pad(string num, bool isDecimal, int padLength)
    {
        return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
    }
}

Here is an other solution that I like: http://www.dreamchain.com/sql-and-alpha-numeric-sort-order/

It's not Microsoft SQL, but since I ended up here when I was searching for a solution for Postgres, I thought adding this here would help others.

EDIT: Here is the code, in case the link goes away.

CREATE or REPLACE FUNCTION pad_numbers(text) RETURNS text AS $$
  SELECT regexp_replace(regexp_replace(regexp_replace(regexp_replace(($1 collate "C"),
    E'(^|\\D)(\\d{1,3}($|\\D))', E'\\1000\\2', 'g'),
      E'(^|\\D)(\\d{4,6}($|\\D))', E'\\1000\\2', 'g'),
        E'(^|\\D)(\\d{7}($|\\D))', E'\\100\\2', 'g'),
          E'(^|\\D)(\\d{8}($|\\D))', E'\\10\\2', 'g');
$$ LANGUAGE SQL;

"C" is the default collation in postgresql; you may specify any collation you desire, or remove the collation statement if you can be certain your table columns will never have a nondeterministic collation assigned.

usage:

SELECT * FROM wtf w 
  WHERE TRUE
  ORDER BY pad_numbers(w.my_alphanumeric_field)