Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert UTF8 data into a SQL Server 2008

Tags:

c#

encoding

I have an issue with encoding. I want to put data from a UTF-8-encoded file into a SQL Server 2008 database. SQL Server only features UCS-2 encoding, so I decided to explicitly convert the retrieved data.

// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);

Here's the conversion routine for the data:

private string ConvertTitle(string title)
{
  string utf8_String = Regex.Replace(Regex.Replace(title, @"\\.", _myEvaluator), @"(?<=[^\\])_", " ");
  byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
  byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
  string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);

  return ucs2_String;
}

When stepping through the code for critical titles, variable watch shows the correct characters for both utf-8 and ucs-2 string. But in the database its - partially wrong. Some special chars are saved correctly, others not.

  • Wrong: ń becomes an n
  • Right: É or é are for example inserted correctly.

Any idea where the problem might be and how to solve it?

Thans in advance, Frank

like image 614
Aaginor Avatar asked Sep 04 '09 13:09

Aaginor


3 Answers

SQL server 2008 handles the conversion from UTF-8 into UCS-2 for you.

First make sure your SQL tables are using nchar, nvarchar data types for the columns. Then you need to tell SQL Server your sending in Unicode data by adding an N in front of the encoded string.

INSERT INTO tblTest (test) VALUES (N'EncodedString')

from Microsoft http://support.microsoft.com/kb/239530

See my question and solution here: How do I convert UTF-8 data from Classic asp Form post to UCS-2 for inserting into SQL Server 2008 r2?

like image 141
Chris Chadwick Avatar answered Sep 20 '22 12:09

Chris Chadwick


I think you have a misunderstanding of what encodings are. An encoding is used to convert a bunch of bytes into a character string. A String does not itself have an encoding associated with it.

Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). But you don't need to know that — to you, they're just strings of characters.

What your function does is:

  1. Takes a string and converts it to UTF-8 bytes.
  2. Takes those UTF-8 bytes and converts them to UTF-16LE bytes. (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.)
  3. Takes those UTF-16LE bytes and converts them back to a string. This gives you the exact same String you had in the first place!

So this function is redundant; you can actually just pass a normal String to SQL Server from .NET and not worry about it.

The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. But nothing in that function will cause Windows to flatten characters like ń to n.

What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled.

SQL Server does use ‘UCS2’ (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR.

like image 27
bobince Avatar answered Sep 20 '22 12:09

bobince


We were also very confused about encoding. Here is an useful page that explains it. Also, answer to following SO question will help to explain it too -

In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?

like image 29
CraftyFella Avatar answered Sep 18 '22 12:09

CraftyFella