Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does SQL Server consider N'㐢㐢㐢㐢' and N'㐢㐢㐢' to be equal?

We are testing our application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.

On both Latin and Japanese-collated systems the following equality is true (U+3422):

N'㐢㐢㐢㐢' = N'㐢㐢㐢'

but the following is not (U+30C1):

N'チチチチ' = N'チチチ'

This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don't know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?

like image 288
Aidan Ryan Avatar asked May 12 '10 12:05

Aidan Ryan


People also ask

What does N in SQL Server mean?

The "N" prefix stands for National Language in the SQL-92 standard, and is used for representing Unicode characters. In the current standard, it must be an upper case , which is what you will typically find implemented in mainstream products.

How do you check for equal conditions in SQL Server?

Example - Equality Operator In SQL Server, you can use the = operator to test for equality in a query. WHERE first_name = 'Jane'; In this example, the SELECT statement above would return all rows from the employees table where the first_name is equal to Jane.

How do I check if two values are equal in SQL?

In SQL, you can use the = operator to test for equality in a query. In this example, the SELECT statement above would return all rows from the suppliers table where the supplier_name is equal to Microsoft.

What does the n mean in SQL Server?

When SQL Server sees the N prefix it knows that the characters inside the single quotes are Unicode characters (i.e. there could be latin, chinese, japanese or a combination of these or any other characters out there).

What is the use of not equal in SQL?

The SQL Server not equal operators are used to test that one value, often a column, does not match the value of another. These operators can also be used in T-SQL code as a part of WHILE loops, IF statements, HAVING clauses, join predicates , SQL GROUP BY or CASE statements.

What is the difference between Jon and N in MySQL?

Performance: If the NAME column is of any type other than nvarchar or nchar, then you should not specify the N prefix. However, if the NAME column is of type nvarchar or nchar, then if you do not specify the N prefix, then 'JON' is treated as non-unicode.

How to avoid data corruption when using N in SQL?

2. Character set: If the column is of type nvarchar or nchar, then always use the prefix N while specifying the character string in the WHERE criteria/UPDATE/INSERT clause. If you do not do this and one of the characters in your string is unicode (like international characters - example - ā) then it will fail or suffer data corruption.


1 Answers

Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn't it will be considered equal to the empty string.

Sorting it all Out: The jury will give this string no weight

In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use Chinese_Simplified_Pinyin_100_CI_AS or Chinese_Simplified_Stroke_Order_100_CI_AS:

You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.

So the following SQL statement would work as expected:

select * from MyTable where N'' = N'㐀' COLLATE Chinese_Simplified_Stroke_Order_100_CI_AS;

A list of all supported collations can be found in MSDN:

SQL Server 2008 Books Online: Windows Collation Name

like image 199
Dirk Vollmar Avatar answered Nov 16 '22 03:11

Dirk Vollmar