Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should NVARCHAR be used to saved 'accented characters' into Sql Server?

I have the following two fields in a Sql Server table:

When I add some test data with accented characters into the field, it actually stores them! I thought I had to change the column from VARCHAR to NVARCHAR to accept accented characters, etc?

Basically, I thought:

  • VARCHAR = ASCII
  • NVARCHAR = Unicode

So is this a case where façade etc are actually ASCII .. while some other characters would error (if VARCHAR)?

I can see the ç and é characters in the extended ASCII chart (link above) .. so does this mean ASCII includes 0->127 or 0->255?

(Side thought: I guess I'm happy with accepting 0->255 and stripping out anything else.)

Edit

  • DB collation: Latin1_General_CI_AS
  • Server Version: 12.0.5223.6
  • Server Collation: SQL_Latin1_General_CP1_CI_AS
like image 823
Pure.Krome Avatar asked Sep 05 '19 06:09

Pure.Krome


People also ask

When should I use NVARCHAR in SQL Server?

Use nvarchar when the sizes of the column data entries vary considerably. Use nvarchar(max) when the sizes of the column data entries vary considerably, and the string length might exceed 4,000 byte-pairs.

Can NVARCHAR store special characters?

As the name suggests, varchar means character data that is varying. Also known as Variable Character, it is an indeterminate length string data type. It can hold numbers, letters and special characters.

Which is better to use VARCHAR or NVARCHAR?

Today's development platforms or their operating systems support the Unicode character set. Therefore, In SQL Server, you should utilize NVARCHAR rather than VARCHAR. If you do use VARCHAR when Unicode support is present, then an encoding inconsistency will arise while communicating with the database.

How do I save special characters in SQL Server?

Use NVARCHAR instead of VARCHAR. SQL Server provides both datatypes to store character information. For the most part the two datatypes are identical in how you would work with them within SQL Server or from an application.


1 Answers

First the details of what Sql Server is doing.

VARCHAR stores single-byte characters using a specific collation. ASCII only uses 7 bits, or half of the possible values in a byte. A collation references a specific code page (along with sorting and equating rules) to use the other half of the possible values in each byte. These code pages often include support for a limited and specific set of accented characters. If the code page used for your data supports an accent character, you can do it; if it doesn't, you see weird results (unprintable "box" or ? characters). You can even output data stored in one collation as if it had been stored in another, and get really weird stuff that way (but don't do this).

NVARCHAR is unicode, but there is still some reliance on collations. In most situations, you will end up with UTF-16, which does allow for the full range of unicode characters. Certain collations will result instead in UCS-2, which is slightly more limited. See the nchar/nvarchar documentation for more information.

As an additional quirk, the upcoming Sql Server 2019 will include support for UTF-8 in char and varchar types when using the correct collation.


Now to answer the question.

In some rare cases, where you are sure your data only needs to support accent characters originating from a single specific (usually local) culture, and only those specific accent characters, you can get by with the varchar type.

But be very careful making this determination. In an increasingly global and diverse world, where even small businesses want to take advantage of the internet to increase their reach, even within their own community, using an insufficient encoding can easily result in bugs and even security vulnerabilities. The majority of situations where it seems like a varchar encoding might be good enough are really not safe anymore.

Personally, about the only place I use varchar today is mnemonic code strings that are never shown to or provided by an end user; things that might be enum values in procedural code. Even then, this tends to be legacy code, and given the option I'll use integer values instead, for faster joins and more efficient memory use. However, the upcoming UTF-8 support may change this.

like image 94
Joel Coehoorn Avatar answered Sep 18 '22 10:09

Joel Coehoorn