I ran into a case SQL server can store "sofia" and "sofia" as two different strings, but when compared in TSQL, they are the same whatever the COLLATE is used even if the binary Collate:
CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'sofia')
SELECT * FROM #r WHERE NAME = N'sofia'
sofia
sofia
(2 row(s) affected)
IF 'sofia' = 'sofia' COLLATE SQL_Latin1_General_CP1_CI_AS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
IF 'sofia' = 'sofia' COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
I tried to find out the encode of "sofia"
http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp
It said:
// If all else fails, the encoding is probably (though certainly not
// definitely) the user's local codepage! One might present to the user a
// list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
// A full list can be found using Encoding.GetEncodings();
I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match
Looking into the binary I found an interesting fact: “sofia” itself is encoded with UTF16, but it can be generated from "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0) It is shown as lower case. In C#,
“sofia”
[0] 83 byte
[1] 255 byte
[2] 79 byte
[3] 255 byte
[4] 70 byte
[5] 255 byte
[6] 73 byte
[7] 255 byte
[8] 65 byte
[9] 255 byte
"SOFIA"
[0] 83 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 73 byte
[7] 0 byte
[8] 65 byte
[9] 0 byte
"sofia"
[0] 115 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 105 byte
[7] 0 byte
[8] 97 byte
[9] 0 byte
One can create two different directorie/files with name as C:\sofia\, C:\sofia\ or sofia.txt, sofia.txt.
Why does the SQL engine think they are the same while storing them with the original streams?
In order to get just the exact I want I had to convert to binary first:
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)
But this has many side effects, such as culture and case. How can I teach TSQL Engine to know they are different without much cost?
Is there an official name of this kind of string encoding ?
There are two issues going on here.
First: there is the collation issue. Collations define sorting and equality of characters. As @Kazetsukai suggested, the specific collation property to help here is width-sensitivity. However, you cannot simply add _WS
to any collation name and assume that it will be a valid collation. And in fact, SQL_Latin1_General_CP1_CI_AS_WS
is not a valid collation.
There is a finite set of collations that you can get via SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';
. The results of that query indicate that the collation you are probably wanting is Latin1_General_CI_AS_WS
. And any of the collations ending in _BIN2
would work (try to not use collations ending in _BIN
as those have been deprecated, just as collations starting with SQL_
have been).
But, for some reason, even using those does not seem to work:
IF 'sofia' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
IF 'sofia' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
The results for both are "Values are the same". And that brings us to:
Second: When working with NVARCHAR
1 data, you must prefix string literals with a capital N
, else it implicitly converts the characters to their respective VARCHAR
2 characters first (or characters convert to ?
if there is no mapping defined between the Unicode Code Point and a character that exists in the Code Page specified by the collation of the field or operation).
IF N'sofia' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
IF N'sofia' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
Prefixing those literal values with N
allows for the expected behavior and the results for both queries is now "Values are different".
1 The XML
and N
-prefixed types store data as UTF-16 Little Endian. The default handling is just the UCS-2 / Base Multilingual Plane (BMP) characters. But, if using a collation ending in _SC
, then it can properly handle the full UTF-16 with Supplementary Characters.
2 The CHAR
, VARCHAR
, and TEXT
(but don't use this last one as it is deprecated) types are 8-bit ASCII Extended with Code Pages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With