Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode characters causing issues in SQL Server 2005 string comparison

This query:

select *
from op.tag
where tag = 'fussball'

Returns a result which has a tag column value of "fußball". Column "tag" is defined as nvarchar(150).

While I understand they are similar words grammatically, can anyone explain and defend this behavior? I assume it is related to the same collation settings which allow you to change case sensitivity on a column/table, but who would want this behavior? A unique constraint on the column also causes failure on inserts of one value when the other exists due to a constraint violation. How do I turn this off?

Follow-up bonus point question. Explain why this query does not return any rows:

select 1 
where 'fußball' = 'fussball'

Bonus question (answer?): @ScottCher pointed out to me privately that this is due to the string literal "fussball" being treated as a varchar. This query DOES return a result:

select 1 
where 'fußball' = cast('fussball' as nvarchar)

But then again, this one does not:

select 1 
where cast('fußball' as varchar) = cast('fussball' as varchar)

I'm confused.

like image 284
TheSoftwareJedi Avatar asked Jan 23 '23 13:01

TheSoftwareJedi


2 Answers

I guess the Unicode collation set for your connection/table/database specifies that ss == ß. The latter behavior would be because it's on a faulty fast path, or maybe it does a binary comparison, or maybe you're not passing in the ß in the right encoding (I agree it's stupid).

http://unicode.org/reports/tr10/#Searching mentions that U+00DF is special-cased. Here's an insightful excerpt:

Language-sensitive searching and matching are closely related to collation. Strings that compare as equal at some strength level are those that should be matched when doing language-sensitive matching. For example, at a primary strength, "ß" would match against "ss" according to the UCA, and "aa" would match "å" in a Danish tailoring of the UCA.

like image 96
djc Avatar answered Jan 25 '23 04:01

djc


The SELECT does return a row with collation Latin1_General_CI_AS (SQL2000).

It does not with collation Latin1_General_BIN.

You can assign a table column a collation by using the COLLATE < collation > keyword after N/VARCHAR.

You can also compare strings with a specific collation using the syntax

string1 = string2 COLLATE < collation >
like image 32
devio Avatar answered Jan 25 '23 02:01

devio