How does one set the default character set for fields when creating tables in SQL Server? In MySQL one does this:
CREATE TABLE tableName (
name VARCHAR(128) CHARACTER SET utf8
) DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Note that I set the character set twice here. It is redundant, I added both ways just to demonstrate.
I set the collation also to demonstrate that collation is something different. I am not asking about setting the collation. Most questions asking about character sets and encodings in SQL Server are answered with collation, which is not the same thing.
The specific character set / encoding is based on the Code Page, which in turn is based on the Collation of a column, or the Collation of the current database for literals and variables, or the Collation of the Instance for variable / cursor names and GOTO labels, or what is specified in a COLLATE clause if one is ...
The SQL_Latin1_General_CP1_CI_AS collation is a SQL collation and the rules around sorting data for unicode and non-unicode data are different. The Latin1_General_CI_AS collation is a Windows collation and the rules around sorting unicode and non-unicode data are the same.
To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace dbname with the database name: Copy ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci; To exit the mysql program, type \q at the mysql> prompt.
As stated in BOL
Each SQL Server collation specifies three properties:
- The sort order to use for Unicode data types (nchar, nvarchar, and ntext). A sort order defines the sequence in which characters are sorted, and the way characters are evaluated in comparison operations.
- The sort order to use for non-Unicode character data types (char, varchar, and text).
- The code page used to store non-Unicode character data.
The quote above is from 2000 docs. See also this 2008 link. The below also demonstrates this.
DECLARE @T TABLE
(
code TINYINT PRIMARY KEY,
Arabic_CS_AS CHAR(1) COLLATE Arabic_CS_AS NULL,
Cyrillic_General_CS_AS CHAR(1) COLLATE Cyrillic_General_CS_AS NULL,
Latin1_General_CS_AS CHAR(1) COLLATE Latin1_General_CS_AS NULL
);
INSERT INTO @T(code) VALUES (200),(201),(202),(203),(204),(205)
UPDATE @T
SET Arabic_CS_AS=CAST(code AS BINARY(1)),
Cyrillic_General_CS_AS=CAST(code AS BINARY(1)),
Latin1_General_CS_AS=CAST(code AS BINARY(1))
SELECT *
FROM @T
Results
code Arabic_CS_AS Cyrillic_General_CS_AS Latin1_General_CS_AS
---- ------------ ---------------------- --------------------
200 ب И È
201 ة Й É
202 ت К Ê
203 ث Л Ë
204 ج М Ì
205 ح Н Í
To expand on @Martin's answer:
How you set a "character set" in SQL Server depends on the datatype that you are using. If you are using:
NVARCHAR
, NCHAR
, and NTEXT
(NTEXT
is deprecated and shouldn't be used as of SQL Server 2005) all use the Unicode character set and this cannot be changed. These datatypes are all encoded as UTF-16 LE (Little Endian) – a 16-bit encoding with each "character" being either 2 or 4 bytes – and this too cannot be changed. For these datatypes, the Collation being used only affects the locale (as determined by the LCID of the Collation) which determines the set of rules used for sorting and comparison.
XML
, like the N
-prefixed types, uses the Unicode character set and is encoded as UTF-16 LE (Little Endian), and neither of those can be changed. But unlike the other string datatypes, there is no Collation associated with XML
data as it cannot be sorted or compared (at least not without first converting it to NVARCHAR(MAX)
[preferred] or VARCHAR(MAX)
).
VARCHAR
, CHAR
, and TEXT
(TEXT
is deprecated and shouldn't be used as of SQL Server 2005) are all 8-bit encodings with each "character" being either 1 or 2 bytes. The character set is determined by the Code Page associated with each Collation. The sorting and comparison rules depend on the type of Collation being used:
SQL_
and have been deprecated since SQL Server 2000, though are (unfortunately) still in wide use today. These use simple rules indicated as the "SQL Server Sort Order" number as found in the description
field returned by sys.fn_helpcollations()
.SQL_
. These Collations allow the non-Unicode string data to use the Unicode sorting and comparison rules indicated by the LCID of the Collation.That being said, to find out which character set (for CHAR
, VARCHAR
, and TEXT
– i.e. non-Unicode – data) is being used, run the following query and pay close attention to the CodePage
field. The LCID
field indicates the locale used for sorting and comparison rules for the N
-prefixed – i.e. Unicode – types as well as the non-Unicode types if using a Windows Collation:
SELECT *,
COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
COLLATIONPROPERTY(col.[name], 'LCID') AS [LCID]
FROM sys.fn_helpcollations() col
ORDER BY col.[name];
The Code Page IDs can be translated into something more meaningful via the MSDN page for Code Page Identifiers.
Regarding the O.P.'s comment on @Martin's answer:
It is unfortunate that they chose the misleading/incomplete term "collation" which clearly refers to sort order: collate definition.
While it is true that Microsoft could have done better when choosing a name, there is unfortunately a general, industry-wide confusion over terms such as "encoding", "character set", "collation", etc. Microsoft's use (or misuse) of "Collation" has merely contributed to the mass confusion. But, that confusion is also evident in MySQL as shown in this question, given that "utf8" is specifically not a character set ;-).
UTF-8 is one of several encodings for the Unicode character set. UTF-16 and UTF-32 are the other two encodings. All three of those encodings represent the exact same Unicode character set, just in different ways. Looking at the list of MySQL character sets – 11.1.10 Supported Character Sets and Collations – the "ucs2", "utf8", "utf8mb4", "utf16", "utf16le", "utf32" charsets are not actually character sets, per se, but various representations of the Unicode character set. But, given the overlap between the concepts of "character set" and "encoding", it would be difficult to not have this confusion. The 11.1.10.1 Unicode Character Sets page indicates that the "utf8mb4", "utf16", "utf16le", and "utf32" charsets are the full Unicode character sets while "ucs2" and "utf8" are subsets of the Unicode character set, specifically the first 65,536 code points (a.k.a. Basic Multilingual Plane (BMP)).
For more info regarding Collation across various RDBMS's, please see my answer to the following question on DBA.StackExchange:
Does any DBMS have a collation that is both case-sensitive and accent-insensitive?
UPDATE 2018-10-02
While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR
/ CHAR
datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With