Replacing specific Unicode characters in strings read from Excel

Question

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.

I retrieve the cell contents into a string thus:

string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();

Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):

s = Regex.Replace(s, "\u0094", "\u0022");

But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:

bool a = s.Contains("\u0094");

but it returns false. However:

bool b = s.Contains("”");

returns true.

My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.

Jonno · Accepted Answer

Yes strings in .Net are UTF-16.

You're doing it right; perhaps your hex-math is incorrect. The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:

((int)"”"[0]).ToString("X") returns "201D"

"”" == "\u201D" returns true

"\u0094" == "" (right hand side is the empty string) returns false

A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.

References - Jon Skeet's articles on:

Strings
Unicode

Mihai Nita · Answer

You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters. That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

Replacing specific Unicode characters in strings read from Excel

Tags:

c#

encoding

unicode

Sid Holland

2 Answers

Jonno

Mihai Nita

Recent Activity

Donate For Us

Replacing specific Unicode characters in strings read from Excel

Tags:

c#

encoding

unicode

Sid Holland

2 Answers

Jonno

Mihai Nita

Related questions

Recent Activity

Donate For Us