Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can SQL Server SQL_Latin1_General_CP1_CI_AS be safely converted to Latin1_General_CI_AS?

We have a legacy database with some (older) columns using "SQL_Latin1_General_CP1_CI_AS" and more recent changes have used "Latin1_General_CI_AS".

This is a pain as joins need the additional COLLATE statement to work.

I'd like to bring everything up to "Latin1_General_CI_AS". From what I can gather they are more or less identical collations and I won't lose data during this process...

Does anyone know if this is the case?

like image 317
Kram Avatar asked Jun 09 '11 17:06

Kram


People also ask

Is SQL_Latin1_General_CP1_CI_AS the same as Latin1_General_CI_AS?

The SQL_Latin1_General_CP1_CI_AS collation is a SQL collation and the rules around sorting data for unicode and non-unicode data are different. The Latin1_General_CI_AS collation is a Windows collation and the rules around sorting unicode and non-unicode data are the same.

Can we change SQL Server collation?

You can change the collation of any new objects that are created in a user database by using the COLLATE clause of the ALTER DATABASE statement. This statement does not change the collation of the columns in any existing user-defined tables. These can be changed by using the COLLATE clause of ALTER TABLE.

Which is the best collation for SQL Server?

However here are the settings we typically recommend: Set the default collation to SQL_Latin1_General_CP1_CI_AS. Ensure that you have SQL Server running in Case Insensitive mode. We use NCHAR, NVARCHAR string types so all data is unicode, so no character set is specified.

How do you solve Cannot resolve the collation conflict between SQL_Latin1_General_CP1_CI_AS and Latin1_General_CI_AS in the equal to operation?

Issue: Cannot resolve the collation conflict between “SQL_Latin1_General_CP1_CI_AS” and “Latin1_General_CI_AS” Simply apply the default collation to the fields you are comparing.


2 Answers

Here is a more complete answer:

https://www.olcot.co.uk/revised-difference-between-collation-sql_latin1_general_cp1_ci_as-and-latin1_general_ci_as/

The key difference between these collations is in how they apply character expansion rules. Certain Latin characters may be expanded into multiple characters. The SQL_xxxx collations may ignore these character expansions when working with non-unicode text, but apply them for unicode text. As a result: joins, sorts, and comparisons may return different results when using one collation versus the other.

Example:

Under Latin1_General_CI_AS these two statements return the same set of records, as ß is expanded to ss.

SELECT * FROM MyTable3 WHERE Comments = 'strasse' SELECT * FROM MyTable3 WHERE Comments = 'straße' 

When using SQL_Latin1_General_CP1_CI_AS the above statements return different records, since the ß is treated as a different character than ss.

like image 60
Zarepheth Avatar answered Sep 20 '22 22:09

Zarepheth


If you are going to change the Collation of a Database, then there is definitely stuff you should know about so that you can plan accordingly:

  • Regarding data-loss potential:

    • NVARCHAR fields are all Unicode, which is a single character set, so there can't be any data loss for these fields (this also covers XML fields which are also stored as UTF-16 Little Endian). Meta-data fields that store the object / column / index / etc names are all NVARCHAR so no need to worry about those.
    • VARCHAR fields having different Collations but the same Code Page between the differing Collations will not be a problem since the Code Page is the character set.
    • VARCHAR fields having different Collations and moving to a different Code Page (when changing Collations) can have data loss if any of the characters being used are not represented in the new Code Page. HOWEVER, this is only an issue when physically changing the Collation of a particular field (described below) and would not happen upon changing the default Collation of a database.
  • Local variables and string literals get their Collation from the Database default. Changing the database default will change the Collation used for both local variables and string literals. But changing the Database's default Collation does not change the Collation used for existing string columns in the tables in that Database. This generally should not cause any problems when comparing or concatenating a column with a literal and/or variable since the literals and variables will take on the Collation of the column due to Collation Precedence. The only potential problem would be Code Page conversions that might occur for characters of values between 128 - 255 that are not available in the Code Page used by the Collation of the column.

  • If you are expecting a predicate / comparison / sort / concatenation / etc for a column to behave differently upon changing the Database's default Collation, then you will need to explicitly change that column's Collation using the following command:

    ALTER TABLE [{table_name}]    ALTER COLUMN [{column_name}]    {same_datatype}    {same_NULL_or_NOT NULL_setting}    COLLATE {name_of_Database_default_Collation}; 

    Be sure to specify the exact same datatype and NULL / NOT NULL setting that are currently being used, else they can revert to the default if not already being the default value. After that, if there are any indexes on any of the string columns that just had their Collation changed, then you need to rebuild those indexes.

  • Changing the Database's default Collation will change the Collation of certain database-specific meta-data, such as the name field in both sys.objects, sys.columns, sys.indexes, etc. Filtering these system Views against local variables or string literals won't be a problem since the Collation will be changing on both sides. But, if you JOIN any of the local system Views to temporary tables on string fields, and the Database-level Collation between the local database and tempdb doesn't match, then you will get the "Collation mismatch" error. This is discussed below along with the remedy.

  • One difference between these two Collations is in how they sort certain characters for VARCHAR data (this does not affect NVARCHAR data). The non-EBCDIC SQL_ Collations use what is called "String Sort" for VARCHAR data, while all other Collations, and even NVARCHAR data for the non-EBCDIC SQL_ Collations, use what is called "Word Sort". The difference is that in "Word Sort", the dash - and apostrophe ' (and maybe a few other characters?) are given a very low weight and are essentially ignored unless there are no other differences in the strings. To see this behavior in action, run the following:

    DECLARE @Test TABLE (Col1 VARCHAR(10) NOT NULL); INSERT INTO @Test VALUES ('aa'); INSERT INTO @Test VALUES ('ac'); INSERT INTO @Test VALUES ('ah'); INSERT INTO @Test VALUES ('am'); INSERT INTO @Test VALUES ('aka'); INSERT INTO @Test VALUES ('akc'); INSERT INTO @Test VALUES ('ar'); INSERT INTO @Test VALUES ('a-f'); INSERT INTO @Test VALUES ('a_e'); INSERT INTO @Test VALUES ('a''kb');  SELECT * FROM @Test ORDER BY [Col1] COLLATE SQL_Latin1_General_CP1_CI_AS; -- "String Sort" puts all punctuation ahead of letters  SELECT * FROM @Test ORDER BY [Col1] COLLATE Latin1_General_100_CI_AS; -- "Word Sort" mostly ignores dash and apostrophe 

    Returns:

    String Sort ----------- a'kb a-f a_e aa ac ah aka akc am ar 

    and:

    Word Sort --------- a_e aa ac a-f ah aka a'kb akc am ar 

    While you will "lose" the "String Sort" behavior, I'm not sure that I would call that a "feature". It is a behavior that has been deemed undesirable (as evidenced by the fact that it wasn't brought forward into any of the Windows collations). However, it is a definite difference of behavior between the two collations (again, just for non-EBCDIC VARCHAR data), and you might have code and/or customer expectations based upon the "String Sort" behavior. This requires testing your code and possibly researching to see if this change in behavior might have any negative impact on users.

  • Another difference between SQL_Latin1_General_CP1_CI_AS and Latin1_General_100_CI_AS is the ability to do Expansions on VARCHAR data (NVARCHAR data can already do these for most SQL_ Collations), such as handling æ as if it were ae:

    IF ('æ' COLLATE SQL_Latin1_General_CP1_CI_AS =     'ae' COLLATE SQL_Latin1_General_CP1_CI_AS) BEGIN   PRINT 'SQL_Latin1_General_CP1_CI_AS'; END;  IF ('æ' COLLATE Latin1_General_100_CI_AS =     'ae' COLLATE Latin1_General_100_CI_AS) BEGIN   PRINT 'Latin1_General_100_CI_AS'; END; 

    Returns:

    Latin1_General_100_CI_AS 

    The only thing you are "losing" here is not being able to do these expansions. Generally speaking, this is another benefit of moving to a Windows Collation. However, just like with the "String Sort" to "Word Sort" move, the same caution applies: it is a definite difference of behavior between the two collations (again, just for VARCHAR data), and you might have code and/or customer expectations based upon not having these mappings. This requires testing your code and possibly researching to see if this change in behavior might have any negative impact on users.

    (first noted in @Zarepheth's answer and expanded on here)

  • Another difference (that is also a benefit of moving to a Windows Collation) is that filtering a VARCHAR column that is indexed on NVARCHAR literal / variable / column you will no longer invalidate the index on the VARCHAR column. This is due to the Windows Collations using the same Unicode sorting and comparison rules for both VARCHAR and NVARCHAR data. Because the sort order is the same between the two types, when the VARCHAR data gets converted into NVARCHAR (explicitly or implicitly due to datatype precedence), the order of items in the index is still valid. For more details on this behavior, please see my post: Impact on Indexes When Mixing VARCHAR and NVARCHAR Types.

  • The server-level Collation is used to set the Collation of the system databases, which includes [model]. The [model] database is used as a template to create new databases, which includes [tempdb] upon each server startup. So, if the Database's default collation does not match the instance's default Collation and you join local tables to temporary tables on string fields, then you will get the Collation-mismatch error. Fortunately there is a somewhat easy way to correct for collation differences between the database that is "current" when CREATE #TempTable is executed and [tempdb]. When creating temporary tables, declare a collation (on string columns) using the COLLATE clause and use either a specific collation (if you know that the DB will always be using that collation), or DATABASE_DEFAULT (if you don't always know the collation of the DB where this code will execute):

    CREATE TABLE #Temp (Col1 NVARCHAR(40) COLLATE DATABASE_DEFAULT); 

    This is not necessary for table variables since they get their default Collation from the "current" database. However, if you have both table variables and temporary tables and join them on string fields, then you will need to use COLLATE {specific_collation} or COLLATE DATABASE_DEFAULT as shown directly above.

  • The server-level collation also controls local variable names, CURSOR variable names, and GOTO labels. While none of these would be impacted by the specific change being dealt with in this Question, it is at least something to be aware of.

  • It is best to use the most recent version of the desired collation, if multiple versions are available. Starting in SQL Server 2005, a "90" series of collations was introduced, and SQL Server 2008 introduced a "100" series of collations. You can find these collations by using the following queries:

    SELECT * FROM sys.fn_helpcollations() WHERE [name] LIKE N'%[_]90[_]%'; -- 476  SELECT * FROM sys.fn_helpcollations() WHERE [name] LIKE N'%[_]100[_]%'; -- 2686 
  • ALSO, while the question asks about case-insensitive Collations, it should be noted that if someone else is looking to make a similar change but is using case-sensitive Collations, then another difference between SQL Server Collations and Windows Collations, for VARCHAR data only, is which case sorts first. Meaning, if you have both A and a, the SQL_ Collations will sort A before a, while the non-SQL_ Collations (and the SQL_ Collations when dealing with NVARCHAR data) will sort a before A.

For a lot more info and details on changing the Collation of a Database or of the entire Instance, please see my post:
Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?

For more info on working with strings and collations, please visit: Collations Info

like image 45
Solomon Rutzky Avatar answered Sep 19 '22 22:09

Solomon Rutzky