Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the point of COLLATIONS for nvarchar (Unicode) columns?

IVe read a lot about this.

Still some questions :

Im not talking about case sensitive here...

  • If I have a char (ש for example) and he is stored in nvarchar - which can hold anything , Why would I need collation here ?

  • If I'm "FaceBook" and i need the ability to store all chars from all languages , What is the relationship between the collation and my nvarchar columns ?

Thanks in advance.

like image 771
Royi Namir Avatar asked Mar 18 '12 07:03

Royi Namir


People also ask

Why do we need collation?

A collation allows character data for a given language to be sorted using rules that define the correct character sequence, with options for specifying case-sensitivity, accent marks, kana character types, use of symbols or punctuation, character width, and word sorting.

When should we use NVARCHAR data type for a column?

Use nvarchar when the sizes of the column data entries vary considerably. Use nvarchar(max) when the sizes of the column data entries vary considerably, and the string length might exceed 4,000 byte-pairs.

What are collations in SQL Server?

Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collations that are used with character data types, such as char and varchar, dictate the code page and corresponding characters that can be represented for that data type.

Why should you choose the NVARCHAR data type over varchar?

The key difference between varchar and nvarchar is the way they are stored, varchar is stored as regular 8-bit data(1 byte per character) and nvarchar stores data at 2 bytes per character. Due to this reason, nvarchar can hold upto 4000 characters and it takes double the space as SQL varchar.


1 Answers

Storing and representing characters is one thing, and knowing how to sort and compare them is another.

Unicode data, stored in the XML and N-prefixed types in SQL Server, can represent all characters in all languages (for the most part, and that is its goal) with a single character set. So for NCHAR / NVARCHAR data (I am leaving out NTEXT as it shouldn't be used anymore, and XML as it is not affected by Collations), the Collations do not change what characters can be stored. For CHAR and VARCHAR data, the Collations do affect what can be stored as each Collation points to a particular Code Page, which determines what can be stored in values 128 - 255.

Now, while there is a default sort order for all characters, that cannot possibly work across all languages and cultures. There are many languages that share some / many / all characters, but have different rules for how to sort them. For example, the letter "C" comes before the letter "D" in most alphabets that use those letters. In US English, a combination of "C" and "H" (i.e. "CH" as two separate letters) would naturally come before any string starting with a "D". But, in a few languages, the two-letter combination of "CH" is special and sorts after "D":

IF (   N'CH' COLLATE Czech_CI_AI > N'D' COLLATE Czech_CI_AI
   AND N'C'  COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
   AND N'CI' COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
   ) PRINT 'Czech_CI_AI';

IF (   N'CH' COLLATE Czech_100_CI_AI > N'D' COLLATE Czech_100_CI_AI
   AND N'C'  COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
   AND N'CI' COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
   ) PRINT 'Czech_100_CI_AI';

IF (   N'CH' COLLATE Slovak_CI_AI > N'D' COLLATE Slovak_CI_AI
   AND N'C'  COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
   AND N'CI' COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
   ) PRINT 'Slovak_CI_AI';

IF (   N'CH' COLLATE Slovak_CS_AS > N'D' COLLATE Slovak_CS_AS
   AND N'C'  COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
   AND N'CI' COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
   ) PRINT 'Slovak_CS_AS';

IF (   N'CH' COLLATE Latin1_General_100_CI_AS > N'D' COLLATE Latin1_General_100_CI_AS
   AND N'C'  COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
   AND N'CI' COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
   ) PRINT 'Latin1_General_100_CI_AS'
ELSE PRINT 'Nope!';

Returns:

Czech_CI_AI
Czech_100_CI_AI
Slovak_CI_AI
Slovak_CS_AS
Nope!

To see examples of sorting rules across various cultures, please see: Collation Charts.

Also, in some languages certain letters or combinations of letters equate to other letters in ways that they do not in most other languages. For example, only in Danish does a "å" equate to "aa". But, the "å" does not equate to just a single "a":

IF (N'aa' COLLATE Danish_Greenlandic_100_CI_AI =  N'å' COLLATE Danish_Greenlandic_100_CI_AI
AND N'a'  COLLATE Danish_Greenlandic_100_CI_AI <> N'å' COLLATE Danish_Greenlandic_100_CI_AI
   ) PRINT 'Danish_Greenlandic_100_CI_AI';

IF (   N'aa' COLLATE Danish_Norwegian_CI_AI =  N'å' COLLATE Danish_Norwegian_CI_AI
   AND N'a'  COLLATE Danish_Norwegian_CI_AI <> N'å' COLLATE Danish_Norwegian_CI_AI
   ) PRINT 'Danish_Norwegian_CI_AI';

IF (   N'aa' COLLATE Latin1_General_100_CI_AI =  N'å' COLLATE Latin1_General_100_CI_AI
   AND N'a'  COLLATE Latin1_General_100_CI_AI <> N'å' COLLATE Latin1_General_100_CI_AI
   ) PRINT 'Latin1_General_100_CI_AI'
ELSE PRINT 'Nope!';

Returns:

Danish_Greenlandic_100_CI_AI
Danish_Norwegian_CI_AI
Nope!

This is all highly complex, and I haven't even mentioned handling for right-to-left languages (Hebrew and Arabic), Chinese, Japanese, combining characters, etc.

If you want some deep insight into the rules, check out the Unicode Collation Algorithm (UCA). The examples above are based on examples in that documentation, though I do not believe all of the rules in the UCA have been implemented, especially since the Windows collations (collations not starting with SQL_) are based on Unicode 5.0 or 6.0, depending on the which OS you are using and the version of the .NET Framework that is installed (see SortVersion for details).

So that is what the Collations do. If you want to see all of the Collations that are available, just run the following:

SELECT [name] FROM sys.fn_helpcollations() ORDER BY [name];
like image 89
Solomon Rutzky Avatar answered Nov 02 '22 19:11

Solomon Rutzky