We are testing our application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.
On both Latin and Japanese-collated systems the following equality is true (U+3422):
N'㐢㐢㐢㐢' = N'㐢㐢㐢'
but the following is not (U+30C1):
N'チチチチ' = N'チチチ'
This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don't know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?
The "N" prefix stands for National Language in the SQL-92 standard, and is used for representing Unicode characters. In the current standard, it must be an upper case , which is what you will typically find implemented in mainstream products.
Example - Equality Operator In SQL Server, you can use the = operator to test for equality in a query. WHERE first_name = 'Jane'; In this example, the SELECT statement above would return all rows from the employees table where the first_name is equal to Jane.
In SQL, you can use the = operator to test for equality in a query. In this example, the SELECT statement above would return all rows from the suppliers table where the supplier_name is equal to Microsoft.
When SQL Server sees the N prefix it knows that the characters inside the single quotes are Unicode characters (i.e. there could be latin, chinese, japanese or a combination of these or any other characters out there).
The SQL Server not equal operators are used to test that one value, often a column, does not match the value of another. These operators can also be used in T-SQL code as a part of WHILE loops, IF statements, HAVING clauses, join predicates , SQL GROUP BY or CASE statements.
Performance: If the NAME column is of any type other than nvarchar or nchar, then you should not specify the N prefix. However, if the NAME column is of type nvarchar or nchar, then if you do not specify the N prefix, then 'JON' is treated as non-unicode.
2. Character set: If the column is of type nvarchar or nchar, then always use the prefix N while specifying the character string in the WHERE criteria/UPDATE/INSERT clause. If you do not do this and one of the characters in your string is unicode (like international characters - example - ā) then it will fail or suffer data corruption.
Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn't it will be considered equal to the empty string.
Sorting it all Out: The jury will give this string no weight
In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use Chinese_Simplified_Pinyin_100_CI_AS
or Chinese_Simplified_Stroke_Order_100_CI_AS
:
You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.
So the following SQL statement would work as expected:
select * from MyTable where N'' = N'㐀' COLLATE Chinese_Simplified_Stroke_Order_100_CI_AS;
A list of all supported collations can be found in MSDN:
SQL Server 2008 Books Online: Windows Collation Name
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With