As the default <code>SQL_Latin1_General_CP1_CI_AS</code> collation of SQL server can't differentiate between <code>ss</code> and <code>ß</code>, I want to change the collation of a specific column in a table to <code>SQL_Latin1_General_CP437_BIN2</code>, as advised in here. However, I am not sure whether this is generally a good practice or not. Also I am not sure about the implications other than the following: <ul> <li> Changing the sort order: As I am never sorting the data on this column, it might not be a problem for me. However, if you think otherwise, please let me know.</li> <li> Changing case-insensitivity to case-sensitivity: As my application always provide text in lowercase, I think this change will also not be a problem for me. However, if you think otherwise, please let me know.</li> </ul> I am curious about the other major implications of this change, if any. Additionally, I would also like to know which one of the following would be a most suited for this scenario: <blockquote> SQL_Latin1_General_CP437_BIN Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data <hr> SQL_Latin1_General_CP437_BIN2 Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data <hr> SQL_Latin1_General_CP850_BIN Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data <hr> SQL_Latin1_General_CP850_BIN2 Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data </blockquote> If you think that there are other collations better suited for this scenario, please mention those as well. <hr> Update on 19.03.2017: To anyone coming to this question: <ul> <li>Must check both the answers from @srutzky and @SqlZim, as well as the related referred resources. You don't want to rush into things in this case.</li> <li>As changing collation is not for faint hearted :P, keeping a backup of table data might come in handy.</li> <li>Also check the dependencies on column, such as index and constraint; you may need to drop and create those, as it were in my case.</li> </ul> Have fun :)

A few things about Collations: <ol> <li> The <code>SQL_</code> Collations were deprecated as of SQL Server 2000 (yes, 2000). If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!). The issue with the <code>SQL_</code> Collations is really only related to <code>VARCHAR</code> (i.e. non-Unicode) data as <code>NVARCHAR</code> (i.e. Unicode) data uses the rules from the OS. But the rules for sorting and comparison for <code>VARCHAR</code> data, unfortunately, use a simple mapping and do not include the more complex linguistic rules. This is why <code>ss</code> and <code>ß</code> do not equate when stored as <code>VARCHAR</code> using the same <code>SQL_Latin1_General_CP1_CI_AS</code> Collation. These deprecated Collations also are not able to give a lower weight to dashes when used in the middle of a word. The non-<code>SQL_</code> Collations (i.e. Windows Collations) use the same rules for both <code>VARCHAR</code> and <code>NVARCHAR</code> so the <code>VARCHAR</code> handling is more robust, more consistent with <code>NVARCHAR</code>. </li> <li> The <code>_BIN</code> Collations were deprecated as of SQL Server 2005. If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!). The issue with the <code>_BIN</code> Collations is rather subtle as it only affects sorting. Comparisons are the same between <code>_BIN</code> and <code>_BIN2</code> Collations due to them being compared at the byte level (hence no linguistic rules). BUT, due to SQL Server (and Windows / PCs) being Little Endian, entities are stored in reverse byte order. This becomes apparent when dealing with double-byte "characters", which is what <code>NVARCHAR</code> data is: UTF-16 Little Endian. This means that Unicode Code Point U+1216 has a hex/binary representation of 0x1216 on Big Endian systems, but is stored as 0x1612 on Little Endian systems. To come full circle so that the importance of this last point will (hopefully) become obvious: the <code>_BIN</code> Collations will compare byte by byte (after the first character) and hence see U+1216 as being 0x16 and then 0x12, while the <code>_BIN2</code> Collations will compare code point by code point and hence see U+1216 as being 0x12 and then 0x16. </li> <li>This particular column is <code>NVARCHAR</code> (a <code>VARCHAR</code> column using <code>SQL_Latin1_General_CP1_CI_AS</code> would not equate <code>ss</code> and <code>ß</code>) and so for just this column alone, there is no difference between <code>SQL_Latin1_General_CP437_BIN2</code> and <code>SQL_Latin1_General_CP850_BIN2</code> due to Unicode being a single, all-inclusive character set.</li> <li>For <code>VARCHAR</code> data, there would be a difference since they are different code pages (437 and 850), and both of those are different than the one that you are using now (<code>CP1</code> == code page 1252).</li> <li> While using a binary Collation is often overkill, in this case it might be necessary given that there is only one locale / culture that does not equate <code>ß</code> with <code>ss</code>: Hungarian. Using a Hungarian Collation might have some linguistic rules that you don't want (or at least wouldn't expect), so the binary Collation seems to be the better choice here (just not any of the 4 you are asking about :-). Just keep in mind that by using a binary Collation, not only are you giving up all linguistic rules, but you also lose the ability to equate different versions of the same character, such as <code>A</code> (Latin Capital Letter A U+0041) and <code>Ａ</code> (Fullwidth Latin Capital Letter A U+FF21). Use the following query to see what Collations are non-binary and do not equate these characters: <pre class="prettyprint"><code>DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1;'; SELECT @SQL += REPLACE(N' IF(N''ß'' COLLATE {Name} <> N''ss'' COLLATE {Name}) BEGIN RAISERROR(N''%4d. {Name}'', 10, 1, @Counter) WITH NOWAIT; SET @Counter += 1; END; ', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10) FROM sys.fn_helpcollations() col WHERE col.[name] NOT LIKE N'SQL[_]%' AND col.[name] NOT LIKE N'%[_]BIN%' ORDER BY col.[name] --PRINT @SQL; EXEC (@SQL); </code></pre> </li> </ol> So: <ul> <li>If you are going to use a binary Collation, use something like <code>Latin1_General_100_BIN2</code>.</li> <li>You do not need to change the Collation of the entire DB and all of its tables. That is a lot of work, and the only "built-in" mechanism to do it is undocumented (i.e. unsupported).</li> <li>If you were to change the Database's default Collation, that affects name resolution of Database-scoped items such as tables, columns, indexes, functions, stored procedures, etc. Meaning: you would need to regress 100% of the application that touches the database, as well as all SQL Server Agent jobs, etc. that touch this database.</li> <li> If most / all of the queries that use this column need <code>ß</code> with <code>ss</code> to be seen as different, then go ahead and alter the column to use <code>Latin1_General_100_BIN2</code>. This will likely require dropping the following dependent objects and then recreating after the <code>ALTER TABLE</code>: <ul> <li>Indexes</li> <li>Unique Constraints</li> <li>Foreign Key Constraints </li> </ul> HINT: Be sure to check the current NULL / NOT NULL setting of the column and specify that in the <code>ALTER TABLE ... ALTER COLUMN ...</code> statement so that it does not get changed. </li> <li>If only some queries need this different behavior, then override just those comparison operations with the <code>COLLATE</code> clause, on a per-condition basis (e.g. <code>WHERE tab.[ThisColumn] LIKE N'%ss%' COLLATE Latin1_General_100_BIN2</code>). The <code>COLLATE</code> keyword should only be needed on one side (of the operator) as Collation Precedence will apply it to the other side.</li> </ul> For more info on working with strings and collations, please visit: Collations Info

In general, <code>BIN2</code> would be preferable over <code>BIN</code>, and you may want to choose a windows collation over a sql collation. e.g. <code>Latin1_General_100_BIN2</code> Guidelines for Using BIN and BIN2 Collations <blockquote> Guidelines for Using BIN Collations If your SQL Server applications interact with older versions of SQL Server that use binary collations, continue to use binary. Binary collations might be a more suitable choice for mixed environments. </blockquote> <hr> <blockquote> For similar reasons to what has just been stated regarding the BIN2 collations, unless you have specific requirements to maintain backwards-compatibility behavior, you should lean towards using the Windows collations and not the SQL Server-specific collations (i.e. the ones starting with SQL are now considered kinda "sucky" ;-) ). - @srutzky - Latin1_General_BIN performance impact when changing the database default collation </blockquote> <hr> rextester demo: http://rextester.com/KIIDYH74471 <pre class="prettyprint"><code>create table t ( a varchar(16) --collate SQL_Latin1_General_CP1_CI_AS /* default */ , b varchar(16) --collate SQL_Latin1_General_CP1_CI_AS , c nvarchar(16) --collate SQL_Latin1_General_CP1_CI_AS , d nvarchar(16) --collate SQL_Latin1_General_CP1_CI_AS ); insert into t values ('ss','ß',N'ss',N'ß'); select * , case when a = b then '=' else '!=' end as [a=b] /* != */ , case when a = d then '=' else '!=' end as [a=d] /* = */ , case when c = b then '=' else '!=' end as [c=b] /* = */ , case when c = d then '=' else '!=' end as [c=d] /* = */ from t; </code></pre> returns: <pre class="prettyprint"><code>+----+---+----+---+-----+-----+-----+-----+ | a | b | c | d | a=b | a=d | c=b | c=d | +----+---+----+---+-----+-----+-----+-----+ | ss | ß | ss | ß | != | = | = | = | +----+---+----+---+-----+-----+-----+-----+ </code></pre> <hr> <pre class="prettyprint"><code>create table t ( a varchar(16) collate Latin1_General_100_BIN2 , b varchar(16) collate Latin1_General_100_BIN2 , c nvarchar(16) collate Latin1_General_100_BIN2 , d nvarchar(16) collate Latin1_General_100_BIN2 ); insert into t values ('ss','ß',N'ss',N'ß'); select * , case when a = b then '=' else '!=' end as [a=b] /* != */ , case when a = d then '=' else '!=' end as [a=d] /* != */ , case when c = b then '=' else '!=' end as [c=b] /* != */ , case when c = d then '=' else '!=' end as [c=d] /* != */ from t; </code></pre> returns: <pre class="prettyprint"><code>+----+---+----+---+-----+-----+-----+-----+ | a | b | c | d | a=b | a=d | c=b | c=d | +----+---+----+---+-----+-----+-----+-----+ | ss | ß | ss | ß | != | != | != | != | +----+---+----+---+-----+-----+-----+-----+ </code></pre>

Choosing a binary collation that can differentiate between 'ss' and 'ß' for nvarchar column in Sql Server

Tags:

sql-server

tsql

sql-server-2016

unicode

collation

As the default SQL_Latin1_General_CP1_CI_AS collation of SQL server can't differentiate between ss and ß, I want to change the collation of a specific column in a table to SQL_Latin1_General_CP437_BIN2, as advised in here.

However, I am not sure whether this is generally a good practice or not. Also I am not sure about the implications other than the following:

Changing the sort order: As I am never sorting the data on this column, it might not be a problem for me. However, if you think otherwise, please let me know.
Changing case-insensitivity to case-sensitivity: As my application always provide text in lowercase, I think this change will also not be a problem for me. However, if you think otherwise, please let me know.

I am curious about the other major implications of this change, if any.

Additionally, I would also like to know which one of the following would be a most suited for this scenario:

SQL_Latin1_General_CP437_BIN

Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data

SQL_Latin1_General_CP437_BIN2

Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data

SQL_Latin1_General_CP850_BIN

Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data

SQL_Latin1_General_CP850_BIN2

Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data

If you think that there are other collations better suited for this scenario, please mention those as well.

Update on 19.03.2017: To anyone coming to this question:

Must check both the answers from @srutzky and @SqlZim, as well as the related referred resources. You don't want to rush into things in this case.
As changing collation is not for faint hearted :P, keeping a backup of table data might come in handy.
Also check the dependencies on column, such as index and constraint; you may need to drop and create those, as it were in my case.

Have fun :)

759

asked Mar 18 '17 08:03

Sayan Pal

2 Answers

A few things about Collations:

The SQL_ Collations were deprecated as of SQL Server 2000 (yes, 2000). If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!).

The issue with the SQL_ Collations is really only related to VARCHAR (i.e. non-Unicode) data as NVARCHAR (i.e. Unicode) data uses the rules from the OS. But the rules for sorting and comparison for VARCHAR data, unfortunately, use a simple mapping and do not include the more complex linguistic rules. This is why ss and ß do not equate when stored as VARCHAR using the same SQL_Latin1_General_CP1_CI_AS Collation. These deprecated Collations also are not able to give a lower weight to dashes when used in the middle of a word. The non-SQL_ Collations (i.e. Windows Collations) use the same rules for both VARCHAR and NVARCHAR so the VARCHAR handling is more robust, more consistent with NVARCHAR.
The _BIN Collations were deprecated as of SQL Server 2005. If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!).

The issue with the _BIN Collations is rather subtle as it only affects sorting. Comparisons are the same between _BIN and _BIN2 Collations due to them being compared at the byte level (hence no linguistic rules). BUT, due to SQL Server (and Windows / PCs) being Little Endian, entities are stored in reverse byte order. This becomes apparent when dealing with double-byte "characters", which is what NVARCHAR data is: UTF-16 Little Endian. This means that Unicode Code Point U+1216 has a hex/binary representation of 0x1216 on Big Endian systems, but is stored as 0x1612 on Little Endian systems. To come full circle so that the importance of this last point will (hopefully) become obvious: the _BIN Collations will compare byte by byte (after the first character) and hence see U+1216 as being 0x16 and then 0x12, while the _BIN2 Collations will compare code point by code point and hence see U+1216 as being 0x12 and then 0x16.
This particular column is NVARCHAR (a VARCHAR column using SQL_Latin1_General_CP1_CI_AS would not equate ss and ß) and so for just this column alone, there is no difference between SQL_Latin1_General_CP437_BIN2 and SQL_Latin1_General_CP850_BIN2 due to Unicode being a single, all-inclusive character set.
For VARCHAR data, there would be a difference since they are different code pages (437 and 850), and both of those are different than the one that you are using now (CP1 == code page 1252).
While using a binary Collation is often overkill, in this case it might be necessary given that there is only one locale / culture that does not equate ß with ss: Hungarian. Using a Hungarian Collation might have some linguistic rules that you don't want (or at least wouldn't expect), so the binary Collation seems to be the better choice here (just not any of the 4 you are asking about :-). Just keep in mind that by using a binary Collation, not only are you giving up all linguistic rules, but you also lose the ability to equate different versions of the same character, such as A (Latin Capital Letter A U+0041) and Ａ (Fullwidth Latin Capital Letter A U+FF21).

Use the following query to see what Collations are non-binary and do not equate these characters:
```
DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1;';

SELECT @SQL += REPLACE(N'
 IF(N''ß'' COLLATE {Name} <> N''ss'' COLLATE {Name})
 BEGIN
 RAISERROR(N''%4d. {Name}'', 10, 1, @Counter) WITH NOWAIT;
 SET @Counter += 1;
 END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM sys.fn_helpcollations() col
WHERE col.[name] NOT LIKE N'SQL[_]%'
AND col.[name] NOT LIKE N'%[_]BIN%'
ORDER BY col.[name]

--PRINT @SQL;
EXEC (@SQL);
```

So:

If you are going to use a binary Collation, use something like Latin1_General_100_BIN2.
You do not need to change the Collation of the entire DB and all of its tables. That is a lot of work, and the only "built-in" mechanism to do it is undocumented (i.e. unsupported).
If you were to change the Database's default Collation, that affects name resolution of Database-scoped items such as tables, columns, indexes, functions, stored procedures, etc. Meaning: you would need to regress 100% of the application that touches the database, as well as all SQL Server Agent jobs, etc. that touch this database.
If most / all of the queries that use this column need ß with ss to be seen as different, then go ahead and alter the column to use Latin1_General_100_BIN2. This will likely require dropping the following dependent objects and then recreating after the ALTER TABLE:
- Indexes
- Unique Constraints
- Foreign Key Constraints
HINT: Be sure to check the current NULL / NOT NULL setting of the column and specify that in the ALTER TABLE ... ALTER COLUMN ... statement so that it does not get changed.
If only some queries need this different behavior, then override just those comparison operations with the COLLATE clause, on a per-condition basis (e.g. WHERE tab.[ThisColumn] LIKE N'%ss%' COLLATE Latin1_General_100_BIN2). The COLLATE keyword should only be needed on one side (of the operator) as Collation Precedence will apply it to the other side.

For more info on working with strings and collations, please visit: Collations Info

156

answered Sep 19 '22 19:09

Solomon Rutzky

In general, BIN2 would be preferable over BIN, and you may want to choose a windows collation over a sql collation. e.g. Latin1_General_100_BIN2

Guidelines for Using BIN and BIN2 Collations

Guidelines for Using BIN Collations

If your SQL Server applications interact with older versions of SQL Server that use binary collations, continue to use binary. Binary collations might be a more suitable choice for mixed environments.

For similar reasons to what has just been stated regarding the BIN2 collations, unless you have specific requirements to maintain backwards-compatibility behavior, you should lean towards using the Windows collations and not the SQL Server-specific collations (i.e. the ones starting with SQL are now considered kinda "sucky" ;-) ).
- @srutzky - Latin1_General_BIN performance impact when changing the database default collation

rextester demo: http://rextester.com/KIIDYH74471

create table t (
    a varchar(16)  --collate SQL_Latin1_General_CP1_CI_AS /* default */
  , b varchar(16)  --collate SQL_Latin1_General_CP1_CI_AS
  , c nvarchar(16) --collate SQL_Latin1_General_CP1_CI_AS
  , d nvarchar(16) --collate SQL_Latin1_General_CP1_CI_AS 
);
insert into t values ('ss','ß',N'ss',N'ß');
select *
    , case when a = b then '=' else '!=' end as [a=b] /* != */
    , case when a = d then '=' else '!=' end as [a=d] /* = */
    , case when c = b then '=' else '!=' end as [c=b] /* = */
    , case when c = d then '=' else '!=' end as [c=d] /* = */
from t;

returns:

+----+---+----+---+-----+-----+-----+-----+
| a  | b | c  | d | a=b | a=d | c=b | c=d |
+----+---+----+---+-----+-----+-----+-----+
| ss | ß | ss | ß | !=  | =   | =   | =   |
+----+---+----+---+-----+-----+-----+-----+

create table t (
    a varchar(16)  collate Latin1_General_100_BIN2
  , b varchar(16)  collate Latin1_General_100_BIN2
  , c nvarchar(16) collate Latin1_General_100_BIN2
  , d nvarchar(16) collate Latin1_General_100_BIN2
);
insert into t values ('ss','ß',N'ss',N'ß');
select *
    , case when a = b then '=' else '!=' end as [a=b] /* != */
    , case when a = d then '=' else '!=' end as [a=d] /* != */
    , case when c = b then '=' else '!=' end as [c=b] /* != */
    , case when c = d then '=' else '!=' end as [c=d] /* != */
from t;

returns:

+----+---+----+---+-----+-----+-----+-----+
| a  | b | c  | d | a=b | a=d | c=b | c=d |
+----+---+----+---+-----+-----+-----+-----+
| ss | ß | ss | ß | !=  | !=  | !=  | !=  |
+----+---+----+---+-----+-----+-----+-----+

answered Sep 16 '22 19:09

SqlZim

Related questions
                            
                                multiply rows in t-sql
                            
                                "Error converting data type varchar to datetime"
                            
                                Cascade Delete turned on?
                            
                                In SQL Server, how to create while loop in select
                            
                                "Must declare the table variable "@name"" in stored procedure
                            
                                Can I use WITH in TSQL twice to filter a result set like my example?
                            
                                Using SqlCommand to execute a non-query, can you get to text normally posted to "Messages"?
                            
                                TSQL: Try-Catch Transaction in Trigger
                            
                                T-SQL dynamic pivot
                            
                                GETDATE() method for DB2
                            
                                Performance Difference between LINQ and Stored Procedures
                            
                                Bulk DELETE on SQL Server 2008 (Is there anything like Bulk Copy (bcp) for delete data?)
                            
                                Database Upserts - Good or Bad Practice?
                            
                                Tsql union of sub queries that each require an ORDER BY clause
                            
                                Get table ID after insert with ColdFusion and MySQL
                            
                                SQL Count and group duplicates
                            
                                How to get Previous Value for Null Values
                            
                                How can I subtract a previous row in sql?
                            
                                Change the order by when selecting "edit top 'N' rows"
                            
                                How to add custom attributes to SQL connection string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With