As the default SQL_Latin1_General_CP1_CI_AS
collation of SQL server can't differentiate between ss
and ß
, I want to change the collation of a specific column in a table to SQL_Latin1_General_CP437_BIN2
, as advised in here.
However, I am not sure whether this is generally a good practice or not. Also I am not sure about the implications other than the following:
I am curious about the other major implications of this change, if any.
Additionally, I would also like to know which one of the following would be a most suited for this scenario:
SQL_Latin1_General_CP437_BIN
Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data
SQL_Latin1_General_CP437_BIN2
Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data
SQL_Latin1_General_CP850_BIN
Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data
SQL_Latin1_General_CP850_BIN2
Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data
If you think that there are other collations better suited for this scenario, please mention those as well.
Update on 19.03.2017: To anyone coming to this question:
Have fun :)
Binary collations sort data based on the sequence of coded values that are defined by the locale and data type. They're case-sensitive. A binary collation in SQL Server defines the locale and the ANSI code page that's used. This enforces a binary sort order.
However here are the settings we typically recommend: Set the default collation to SQL_Latin1_General_CP1_CI_AS. Ensure that you have SQL Server running in Case Insensitive mode. We use NCHAR, NVARCHAR string types so all data is unicode, so no character set is specified.
To view the collation setting of a database In Object Explorer, connect to an instance of the Database Engine and on the toolbar, click New Query. In the query window, enter the following statement that uses the sys. databases system catalog view. SELECT name, collation_name FROM sys.
You can get the server collation in SQL Server Management Studio (SSMS) by right-clicking the SQL instance, then clicking the “Properties” option and checking the “General” tab. This collation is selected by default at the installation of SQL Server.
When installing SQL Server 2019, the default Collation (at least for systems with English as the OS-level language) is still “ SQL_Latin1_General_CP1_CI_AS “. Why, why, WHY????
The third and fourth columns show that given the type the required 4 bytes allows everything to work as expected. On the other hand, when attempting to store the same character, using the UTF-16 4-byte sequence, into an NVARCHAR type that is too small, the result is the default Replacement Character ( U+FFFD ).
VARCHAR is typically one byte per character but can’t represent a wide range of accented characters all at the same time (more than would be found on a single 8-bit Code Page). NVARCHAR , being a Unicode datatype, can represent all characters, but at a cost: each character is typically 2 bytes.
This means that all string operations need to start at the beginning and proceed byte-by-byte. On the other hand, NVARCHAR / UTF-16 is always 2 bytes (even Supplementary Characters are comprised of two 2-byte Code Points), so everything can be read in 2-byte chunks.
A few things about Collations:
The SQL_
Collations were deprecated as of SQL Server 2000 (yes, 2000). If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!).
The issue with the SQL_
Collations is really only related to VARCHAR
(i.e. non-Unicode) data as NVARCHAR
(i.e. Unicode) data uses the rules from the OS. But the rules for sorting and comparison for VARCHAR
data, unfortunately, use a simple mapping and do not include the more complex linguistic rules. This is why ss
and ß
do not equate when stored as VARCHAR
using the same SQL_Latin1_General_CP1_CI_AS
Collation. These deprecated Collations also are not able to give a lower weight to dashes when used in the middle of a word. The non-SQL_
Collations (i.e. Windows Collations) use the same rules for both VARCHAR
and NVARCHAR
so the VARCHAR
handling is more robust, more consistent with NVARCHAR
.
The _BIN
Collations were deprecated as of SQL Server 2005. If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!).
The issue with the _BIN
Collations is rather subtle as it only affects sorting. Comparisons are the same between _BIN
and _BIN2
Collations due to them being compared at the byte level (hence no linguistic rules). BUT, due to SQL Server (and Windows / PCs) being Little Endian, entities are stored in reverse byte order. This becomes apparent when dealing with double-byte "characters", which is what NVARCHAR
data is: UTF-16 Little Endian. This means that Unicode Code Point U+1216 has a hex/binary representation of 0x1216 on Big Endian systems, but is stored as 0x1612 on Little Endian systems. To come full circle so that the importance of this last point will (hopefully) become obvious: the _BIN
Collations will compare byte by byte (after the first character) and hence see U+1216 as being 0x16 and then 0x12, while the _BIN2
Collations will compare code point by code point and hence see U+1216 as being 0x12 and then 0x16.
This particular column is NVARCHAR
(a VARCHAR
column using SQL_Latin1_General_CP1_CI_AS
would not equate ss
and ß
) and so for just this column alone, there is no difference between SQL_Latin1_General_CP437_BIN2
and SQL_Latin1_General_CP850_BIN2
due to Unicode being a single, all-inclusive character set.
For VARCHAR
data, there would be a difference since they are different code pages (437 and 850), and both of those are different than the one that you are using now (CP1
== code page 1252).
While using a binary Collation is often overkill, in this case it might be necessary given that there is only one locale / culture that does not equate ß
with ss
: Hungarian. Using a Hungarian Collation might have some linguistic rules that you don't want (or at least wouldn't expect), so the binary Collation seems to be the better choice here (just not any of the 4 you are asking about :-). Just keep in mind that by using a binary Collation, not only are you giving up all linguistic rules, but you also lose the ability to equate different versions of the same character, such as A
(Latin Capital Letter A U+0041) and A
(Fullwidth Latin Capital Letter A U+FF21).
Use the following query to see what Collations are non-binary and do not equate these characters:
DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1;';
SELECT @SQL += REPLACE(N'
IF(N''ß'' COLLATE {Name} <> N''ss'' COLLATE {Name})
BEGIN
RAISERROR(N''%4d. {Name}'', 10, 1, @Counter) WITH NOWAIT;
SET @Counter += 1;
END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM sys.fn_helpcollations() col
WHERE col.[name] NOT LIKE N'SQL[_]%'
AND col.[name] NOT LIKE N'%[_]BIN%'
ORDER BY col.[name]
--PRINT @SQL;
EXEC (@SQL);
So:
Latin1_General_100_BIN2
.If most / all of the queries that use this column need ß
with ss
to be seen as different, then go ahead and alter the column to use Latin1_General_100_BIN2
. This will likely require dropping the following dependent objects and then recreating after the ALTER TABLE
:
HINT: Be sure to check the current NULL / NOT NULL setting of the column and specify that in the ALTER TABLE ... ALTER COLUMN ...
statement so that it does not get changed.
COLLATE
clause, on a per-condition basis (e.g. WHERE tab.[ThisColumn] LIKE N'%ss%' COLLATE Latin1_General_100_BIN2
). The COLLATE
keyword should only be needed on one side (of the operator) as Collation Precedence will apply it to the other side.For more info on working with strings and collations, please visit: Collations Info
In general, BIN2
would be preferable over BIN
, and you may want to choose a windows collation over a sql collation. e.g. Latin1_General_100_BIN2
Guidelines for Using BIN and BIN2 Collations
Guidelines for Using BIN Collations
If your SQL Server applications interact with older versions of SQL Server that use binary collations, continue to use binary. Binary collations might be a more suitable choice for mixed environments.
For similar reasons to what has just been stated regarding the BIN2 collations, unless you have specific requirements to maintain backwards-compatibility behavior, you should lean towards using the Windows collations and not the SQL Server-specific collations (i.e. the ones starting with SQL are now considered kinda "sucky" ;-) ).
- @srutzky - Latin1_General_BIN performance impact when changing the database default collation
rextester demo: http://rextester.com/KIIDYH74471
create table t (
a varchar(16) --collate SQL_Latin1_General_CP1_CI_AS /* default */
, b varchar(16) --collate SQL_Latin1_General_CP1_CI_AS
, c nvarchar(16) --collate SQL_Latin1_General_CP1_CI_AS
, d nvarchar(16) --collate SQL_Latin1_General_CP1_CI_AS
);
insert into t values ('ss','ß',N'ss',N'ß');
select *
, case when a = b then '=' else '!=' end as [a=b] /* != */
, case when a = d then '=' else '!=' end as [a=d] /* = */
, case when c = b then '=' else '!=' end as [c=b] /* = */
, case when c = d then '=' else '!=' end as [c=d] /* = */
from t;
returns:
+----+---+----+---+-----+-----+-----+-----+
| a | b | c | d | a=b | a=d | c=b | c=d |
+----+---+----+---+-----+-----+-----+-----+
| ss | ß | ss | ß | != | = | = | = |
+----+---+----+---+-----+-----+-----+-----+
create table t (
a varchar(16) collate Latin1_General_100_BIN2
, b varchar(16) collate Latin1_General_100_BIN2
, c nvarchar(16) collate Latin1_General_100_BIN2
, d nvarchar(16) collate Latin1_General_100_BIN2
);
insert into t values ('ss','ß',N'ss',N'ß');
select *
, case when a = b then '=' else '!=' end as [a=b] /* != */
, case when a = d then '=' else '!=' end as [a=d] /* != */
, case when c = b then '=' else '!=' end as [c=b] /* != */
, case when c = d then '=' else '!=' end as [c=d] /* != */
from t;
returns:
+----+---+----+---+-----+-----+-----+-----+
| a | b | c | d | a=b | a=d | c=b | c=d |
+----+---+----+---+-----+-----+-----+-----+
| ss | ß | ss | ß | != | != | != | != |
+----+---+----+---+-----+-----+-----+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With