Why did TSQL treat "ｓｏｆｉａ" as being the same as "sofia"? What string encoding is this?

Question

I ran into a case SQL server can store "ｓｏｆｉａ" and "sofia" as two different strings, but when compared in TSQL, they are the same whatever the COLLATE is used even if the binary Collate:

CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'ｓｏｆｉａ')

SELECT * FROM #r WHERE NAME = N'ｓｏｆｉａ'

sofia
ｓｏｆｉａ

(2 row(s) affected)

IF 'ｓｏｆｉａ' = 'sofia'  COLLATE SQL_Latin1_General_CP1_CI_AS 
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

IF 'ｓｏｆｉａ' = 'sofia'  COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

I tried to find out the encode of "ｓｏｆｉａ"

http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp

It said:

            // If all else fails, the encoding is probably (though certainly not
            // definitely) the user's local codepage! One might present to the user a
            // list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
            // A full list can be found using Encoding.GetEncodings();

I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match

Looking into the binary I found an interesting fact: “ｓｏｆｉａ” itself is encoded with UTF16, but it can be generated from  "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0)  It is shown as lower case. In C#, 

“ｓｏｆｉａ”

                             [0]         83          byte                                    
                             [1]         255        byte
                             [2]         79          byte
                             [3]         255        byte
                             [4]         70          byte
                             [5]         255        byte
                             [6]         73          byte
                             [7]         255        byte
                             [8]         65          byte
                             [9]         255        byte

"SOFIA"

                             [0]         83          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         73          byte
                             [7]         0        byte
                             [8]         65          byte
                             [9]         0        byte

"sofia"

                             [0]         115          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         105          byte
                             [7]         0        byte
                             [8]         97          byte
                             [9]         0        byte

One can create two different directorie/files with name as C:\ｓｏｆｉａ\, C:\sofia\ or  ｓｏｆｉａ.txt, sofia.txt.

Why does the SQL engine think they are the same while storing them with the original streams?

In order to get just the exact I want I had to convert to binary first:

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'ｓｏｆｉａ')

ｓｏｆｉａ

(1 row(s) affected)

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')

sofia

(1 row(s) affected)

But this has many side effects, such as culture and case. How can I teach TSQL Engine to know they are different without much cost?

Is there an official name of this kind of string encoding ?

Solomon Rutzky · Accepted Answer

There are two issues going on here.

First: there is the collation issue. Collations define sorting and equality of characters. As @Kazetsukai suggested, the specific collation property to help here is width-sensitivity. However, you cannot simply add _WS to any collation name and assume that it will be a valid collation. And in fact, SQL_Latin1_General_CP1_CI_AS_WS is not a valid collation.

There is a finite set of collations that you can get via SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';. The results of that query indicate that the collation you are probably wanting is Latin1_General_CI_AS_WS. And any of the collations ending in _BIN2 would work (try to not use collations ending in _BIN as those have been deprecated, just as collations starting with SQL_ have been).

But, for some reason, even using those does not seem to work:

IF 'ｓｏｆｉａ' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF 'ｓｏｆｉａ' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

The results for both are "Values are the same". And that brings us to:

Second: When working with NVARCHAR¹ data, you must prefix string literals with a capital N, else it implicitly converts the characters to their respective VARCHAR² characters first (or characters convert to ? if there is no mapping defined between the Unicode Code Point and a character that exists in the Code Page specified by the collation of the field or operation).

IF N'ｓｏｆｉａ' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF N'ｓｏｆｉａ' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

Prefixing those literal values with N allows for the expected behavior and the results for both queries is now "Values are different".

¹ The XML and N-prefixed types store data as UTF-16 Little Endian. The default handling is just the UCS-2 / Base Multilingual Plane (BMP) characters. But, if using a collation ending in _SC, then it can properly handle the full UTF-16 with Supplementary Characters.

² The CHAR, VARCHAR, and TEXT (but don't use this last one as it is deprecated) types are 8-bit ASCII Extended with Code Pages.

Why did TSQL treat "ｓｏｆｉａ" as being the same as "sofia"? What string encoding is this?

Tags:

c#

sql-server

encoding

unicode

collation

Hong Ao

1 Answers

Solomon Rutzky

Recent Activity

Donate For Us

Why did TSQL treat "ｓｏｆｉａ" as being the same as "sofia"? What string encoding is this?

Tags:

c#

sql-server

encoding

unicode

collation

Hong Ao

1 Answers

Solomon Rutzky

Related questions

Recent Activity

Donate For Us