Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Comparison differences between .NET and T-SQL?

In a test case I've written, the string comparison doesn't appear to work the same way between SQL server / .NET CLR.

This C# code:

string lesser =  "SR2-A1-10-90";
string greater = "SR2-A1-100-10";

Debug.WriteLine(string.Compare("A","B"));
Debug.WriteLine(string.Compare(lesser, greater));

Will output:

-1
1

This SQL Server code:

declare @lesser varchar(20);
declare @greater varchar(20);

set @lesser =  'SR2-A1-10-90';
set @greater = 'SR2-A1-100-10';

IF @lesser < @greater
    SELECT 'Less Than';
ELSE
    SELECT 'Greater than';

Will output:

Less Than

Why the difference?

like image 275
Matt Brunell Avatar asked Sep 27 '10 21:09

Matt Brunell


2 Answers

This is documented here.

Windows collations (e.g. Latin1_General_CI_AS) use Unicode type collation rules. SQL Collations don't.

This causes the hyphen character to be treated differently between the two.

like image 127
Martin Smith Avatar answered Oct 05 '22 22:10

Martin Smith


Further to gbn's answer, you can make them behave the same by using CompareOptions.StringSort in C# (or by using StringComparison.Ordinal). This treats symbols as occurring before alphanumeric symbols, so "-" < "0".

However, Unicode vs ASCII doesn't explain anything, as the hex codes for the ASCII codepage are translated verbatim to the Unicode codepage: "-" is 002D (45) while "0" is 0030 (48).

What is happening is that .NET is using "linguistic" sorting by default, which is based on a non-ordinal ordering and weight applied to various symbols by the specified or current culture. This linguistic algorithm allows, for instance, "résumé" (spelled with accents) to appear immediately following "resume" (spelled without accents) in a sorted list of words, as "é" is given a fractional order just after "e" and well before "f". It also allows "cooperation" and "co-operation" to be placed closely together, as the dash symbol is given low "weight"; it matters only as the absolute final tiebreakers when sorting words like "bits", "bit's", and "bit-shift" (which would appear in that order).

So-called ordinal sorting (strictly according to Unicode values, with or without case insensitivity) will produce very different and sometimes illogical results, as variants of letters usually appear well after the basic undecorated Latin alphabet in ASCII/Unicode ordinals, while symbols occur before it. For instance, "é" comes after "z" and so the words "resume", "rosin", "ruble", "résumé" would be sorted in that order. "Bit's", "Bit-shift", "Biter", "Bits" would be sorted in that order as the apostrophe comes first, followed by the dash, then the letter "e", then the letter "s". Neither of these seem logical from a "natural language" perspective.

like image 45
KeithS Avatar answered Oct 05 '22 20:10

KeithS