Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server Unicode queries with SC Collation

In SQL Server 2012 I have a table with an nvarchar column with collation Latin1_General_100_CI_AS_SC, which is supposed to support unicode surrogate pair characters, or supplementary characters.

  1. When I run this query:

    select KeyValue from terms where KeyValue = N'➰'
    

    (above is a Unicode SC)
    above is a curly loop character with code 10160 (x27B0)
    The result is hundreds of different looking single character entries, even though they all have different UTF-16 codepoints. Is this due to collation? Why isn't there an exact match?

    EDIT: I now think this is due to collation. There seems to be a group of "undefined" characters in the UTF-16 range, more than 1733 characters, and they are treated as the same by this collation. Although, characters with codes above 65535 are treated as unique and those queries return exact matches.

  2. The two queries below have different results:

    select KeyValue from terms where KeyValue = N'π'
    

    returns 3 rows: π and ℼ and ᴨ

    select KeyValue from terms where KeyValue LIKE N'π'
    

    returns 2 rows: π and ℼ

    Why is this?

  3. This is the weirdest of all. This query:

    select KeyValue from terms where KeyValue like N'➰%'
    

    returns ALMOST ALL records in the table, which has many multiple character regular latin character set terms like "8w" or "apple". 90% of those not being returned are starting with "æ". What is happening?

NOTE: Just to give this a bit of context, these are all Wikipedia article titles, not random strings.

like image 450
Rahmi Aksu Avatar asked Jul 03 '15 20:07

Rahmi Aksu


People also ask

How does SQL Server handle Unicode characters?

SQL Server UNICODE() Function The UNICODE() function returns an integer value (the Unicode value), for the first character of the input expression.

What is SC collation?

January 27, 2020 by Jignesh Raiyani. SQL Server collation refers to a set of character and character encoding rules, and influences how information is stored according to the order in the data page, how data is matched by comparing two columns, and how information is arranged in the T-SQL query statement.


1 Answers

SQL Server and thus tempdb also have their own collation, and they may not be the same as a database's or a column's collation. While character literals should be assigned the default collation of the column or database, the above (perhaps overly simplified) T-SQL examples could be misstating/not revealing the true problem. For example, an ORDER BY clause could have been omitted for the sake of simplicity. Are expected results returned when above statements explicitly use https://msdn.microsoft.com/en-us/library/ms184391.aspx ('COLLATE Latin1_General_100_CI_AS_SC')?

like image 86
Bill Avatar answered Sep 19 '22 05:09

Bill