IVe read a lot about this.
Still some questions :
Im not talking about case sensitive here...
If I have a char (ש
for example) and he is stored in nvarchar
- which can hold anything , Why would I need collation
here ?
If I'm "FaceBook" and i need the ability to store all
chars from all
languages , What is the relationship between the collation and my nvarchar columns ?
Thanks in advance.
A collation allows character data for a given language to be sorted using rules that define the correct character sequence, with options for specifying case-sensitivity, accent marks, kana character types, use of symbols or punctuation, character width, and word sorting.
Use nvarchar when the sizes of the column data entries vary considerably. Use nvarchar(max) when the sizes of the column data entries vary considerably, and the string length might exceed 4,000 byte-pairs.
Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collations that are used with character data types, such as char and varchar, dictate the code page and corresponding characters that can be represented for that data type.
The key difference between varchar and nvarchar is the way they are stored, varchar is stored as regular 8-bit data(1 byte per character) and nvarchar stores data at 2 bytes per character. Due to this reason, nvarchar can hold upto 4000 characters and it takes double the space as SQL varchar.
Storing and representing characters is one thing, and knowing how to sort and compare them is another.
Unicode data, stored in the XML
and N
-prefixed types in SQL Server, can represent all characters in all languages (for the most part, and that is its goal) with a single character set. So for NCHAR
/ NVARCHAR
data (I am leaving out NTEXT
as it shouldn't be used anymore, and XML
as it is not affected by Collations), the Collations do not change what characters can be stored. For CHAR
and VARCHAR
data, the Collations do affect what can be stored as each Collation points to a particular Code Page, which determines what can be stored in values 128 - 255.
Now, while there is a default sort order for all characters, that cannot possibly work across all languages and cultures. There are many languages that share some / many / all characters, but have different rules for how to sort them. For example, the letter "C" comes before the letter "D" in most alphabets that use those letters. In US English, a combination of "C" and "H" (i.e. "CH" as two separate letters) would naturally come before any string starting with a "D". But, in a few languages, the two-letter combination of "CH" is special and sorts after "D":
IF ( N'CH' COLLATE Czech_CI_AI > N'D' COLLATE Czech_CI_AI
AND N'C' COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
AND N'CI' COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
) PRINT 'Czech_CI_AI';
IF ( N'CH' COLLATE Czech_100_CI_AI > N'D' COLLATE Czech_100_CI_AI
AND N'C' COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
AND N'CI' COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
) PRINT 'Czech_100_CI_AI';
IF ( N'CH' COLLATE Slovak_CI_AI > N'D' COLLATE Slovak_CI_AI
AND N'C' COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
AND N'CI' COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
) PRINT 'Slovak_CI_AI';
IF ( N'CH' COLLATE Slovak_CS_AS > N'D' COLLATE Slovak_CS_AS
AND N'C' COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
AND N'CI' COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
) PRINT 'Slovak_CS_AS';
IF ( N'CH' COLLATE Latin1_General_100_CI_AS > N'D' COLLATE Latin1_General_100_CI_AS
AND N'C' COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
AND N'CI' COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
) PRINT 'Latin1_General_100_CI_AS'
ELSE PRINT 'Nope!';
Returns:
Czech_CI_AI
Czech_100_CI_AI
Slovak_CI_AI
Slovak_CS_AS
Nope!
To see examples of sorting rules across various cultures, please see: Collation Charts.
Also, in some languages certain letters or combinations of letters equate to other letters in ways that they do not in most other languages. For example, only in Danish does a "å" equate to "aa". But, the "å" does not equate to just a single "a":
IF (N'aa' COLLATE Danish_Greenlandic_100_CI_AI = N'å' COLLATE Danish_Greenlandic_100_CI_AI
AND N'a' COLLATE Danish_Greenlandic_100_CI_AI <> N'å' COLLATE Danish_Greenlandic_100_CI_AI
) PRINT 'Danish_Greenlandic_100_CI_AI';
IF ( N'aa' COLLATE Danish_Norwegian_CI_AI = N'å' COLLATE Danish_Norwegian_CI_AI
AND N'a' COLLATE Danish_Norwegian_CI_AI <> N'å' COLLATE Danish_Norwegian_CI_AI
) PRINT 'Danish_Norwegian_CI_AI';
IF ( N'aa' COLLATE Latin1_General_100_CI_AI = N'å' COLLATE Latin1_General_100_CI_AI
AND N'a' COLLATE Latin1_General_100_CI_AI <> N'å' COLLATE Latin1_General_100_CI_AI
) PRINT 'Latin1_General_100_CI_AI'
ELSE PRINT 'Nope!';
Returns:
Danish_Greenlandic_100_CI_AI
Danish_Norwegian_CI_AI
Nope!
This is all highly complex, and I haven't even mentioned handling for right-to-left languages (Hebrew and Arabic), Chinese, Japanese, combining characters, etc.
If you want some deep insight into the rules, check out the Unicode Collation Algorithm (UCA). The examples above are based on examples in that documentation, though I do not believe all of the rules in the UCA have been implemented, especially since the Windows collations (collations not starting with SQL_
) are based on Unicode 5.0 or 6.0, depending on the which OS you are using and the version of the .NET Framework that is installed (see SortVersion for details).
So that is what the Collations do. If you want to see all of the Collations that are available, just run the following:
SELECT [name] FROM sys.fn_helpcollations() ORDER BY [name];
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With