Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why did TSQL treat "sofia" as being the same as "sofia"? What string encoding is this?

I ran into a case SQL server can store "sofia" and "sofia" as two different strings, but when compared in TSQL, they are the same whatever the COLLATE is used even if the binary Collate:

CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'sofia')

SELECT * FROM #r WHERE NAME = N'sofia'

sofia
sofia

(2 row(s) affected)

IF 'sofia' = 'sofia'  COLLATE SQL_Latin1_General_CP1_CI_AS 
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

IF 'sofia' = 'sofia'  COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

I tried to find out the encode of "sofia"

http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp

It said:

            // If all else fails, the encoding is probably (though certainly not
            // definitely) the user's local codepage! One might present to the user a
            // list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
            // A full list can be found using Encoding.GetEncodings();

I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match

Looking into the binary I found an interesting fact: “sofia” itself is encoded with UTF16, but it can be generated from  "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0)  It is shown as lower case. In C#, 

“sofia”

                             [0]         83          byte                                    
                             [1]         255        byte
                             [2]         79          byte
                             [3]         255        byte
                             [4]         70          byte
                             [5]         255        byte
                             [6]         73          byte
                             [7]         255        byte
                             [8]         65          byte
                             [9]         255        byte

"SOFIA"

                             [0]         83          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         73          byte
                             [7]         0        byte
                             [8]         65          byte
                             [9]         0        byte

"sofia"

                             [0]         115          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         105          byte
                             [7]         0        byte
                             [8]         97          byte
                             [9]         0        byte

One can create two different directorie/files with name as C:\sofia\, C:\sofia\ or  sofia.txt, sofia.txt.

Why does the SQL engine think they are the same while storing them with the original streams?

In order to get just the exact I want I had to convert to binary first:

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')

sofia

(1 row(s) affected)

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')

sofia

(1 row(s) affected)

But this has many side effects, such as culture and case. How can I teach TSQL Engine to know they are different without much cost?

Is there an official name of this kind of string encoding ?

like image 419
Hong Ao Avatar asked Feb 11 '23 01:02

Hong Ao


1 Answers

There are two issues going on here.

First: there is the collation issue. Collations define sorting and equality of characters. As @Kazetsukai suggested, the specific collation property to help here is width-sensitivity. However, you cannot simply add _WS to any collation name and assume that it will be a valid collation. And in fact, SQL_Latin1_General_CP1_CI_AS_WS is not a valid collation.

There is a finite set of collations that you can get via SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';. The results of that query indicate that the collation you are probably wanting is Latin1_General_CI_AS_WS. And any of the collations ending in _BIN2 would work (try to not use collations ending in _BIN as those have been deprecated, just as collations starting with SQL_ have been).

But, for some reason, even using those does not seem to work:

IF 'sofia' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF 'sofia' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

The results for both are "Values are the same". And that brings us to:

Second: When working with NVARCHAR1 data, you must prefix string literals with a capital N, else it implicitly converts the characters to their respective VARCHAR2 characters first (or characters convert to ? if there is no mapping defined between the Unicode Code Point and a character that exists in the Code Page specified by the collation of the field or operation).

IF N'sofia' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF N'sofia' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

Prefixing those literal values with N allows for the expected behavior and the results for both queries is now "Values are different".


1 The XML and N-prefixed types store data as UTF-16 Little Endian. The default handling is just the UCS-2 / Base Multilingual Plane (BMP) characters. But, if using a collation ending in _SC, then it can properly handle the full UTF-16 with Supplementary Characters.

2 The CHAR, VARCHAR, and TEXT (but don't use this last one as it is deprecated) types are 8-bit ASCII Extended with Code Pages.

like image 50
Solomon Rutzky Avatar answered Feb 12 '23 22:02

Solomon Rutzky