Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing specific Unicode characters in strings read from Excel

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.

I retrieve the cell contents into a string thus:

string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();

Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):

s = Regex.Replace(s, "\u0094", "\u0022");

But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:

bool a = s.Contains("\u0094");

but it returns false. However:

bool b = s.Contains("”");

returns true.

My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.

like image 810
Sid Holland Avatar asked Oct 03 '11 05:10

Sid Holland


2 Answers

Yes strings in .Net are UTF-16.

You're doing it right; perhaps your hex-math is incorrect. The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:

((int)"”"[0]).ToString("X") returns "201D"

"”" == "\u201D" returns true

"\u0094" == "" (right hand side is the empty string) returns false

A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.

References - Jon Skeet's articles on:

  • Strings
  • Unicode
like image 93
Jonno Avatar answered Nov 14 '22 12:11

Jonno


You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters. That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

like image 38
Mihai Nita Avatar answered Nov 14 '22 13:11

Mihai Nita