C#: Convert Japanese text encoding in shift-JIS and stored as ASCII into UTF-8

Question

I am trying to convert an old application that has some strings stored in the database as ASCII.

For example, the string: ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð is stored in the database.

Now, if I copy that string in a text editor and save it as ASCII and then open the file in a web browser and set it to automatically detect the Encoding, I get the correct string in japanese: チャネルパートナーの選択, and the page says that the detected encoding is Japanese (Shift_JIS).

When I try to do the conversion in the C# code doing something like this:

var asciiBytes = Encoding.ASCII.GetBytes(text);
var japaneseEncoding = Encoding.GetEncoding(932);
var convertedBytes = Encoding.Convert(japaneseEncoding, Encoding.ASCII, asciiBytes);
var japaneseString = japaneseEncoding.GetString(convertedBytes);

I get ?`???l???p?[?g?i?[???I?? as the japanese String and thus I cannot show it on the webpage.

Any light would be appreciated.

Thanks

Hans Passant · Accepted Answer

some strings stored in the database as ASCII

It isn't ASCII, about none of the characters in ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð are ASCII. Encoding.ASCII.GetBytes(text) is going to produce a lot of huh? characters, that's why you got all those question marks.

The core issue is that the bytes in the dbase column were read with the wrong encoding. You used code page 1252:

var badstringFromDatabase = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var hopefullyRecovered = Encoding.GetEncoding(1252).GetBytes(badstringFromDatabase);
var oughtToBeJapanese = Encoding.GetEncoding(932).GetString(hopefullyRecovered);

Which produces "チャネルパートナーの選択"

This is not going to be completely reliable, code page 1252 has a few unassigned codes that are used in 932. You'll end up with a garbled string from which you cannot recover the original byte value anymore. You'll need to focus on getting the data provider to use the correct encoding.

Matt Mitchell · Answer

As per the other answer, I'm pretty sure you're using ANSI/Default encoding not ASCII.

The following examples seem to get you what you're after.

var japaneseEncoding = Encoding.GetEncoding(932);

// From file bytes
var fileBytes = File.ReadAllBytes(@"C:	emp	est.html");
var japaneseTextFromFile = japaneseEncoding.GetString(fileBytes);
japaneseTextFromFile.Dump();

// From string bytes
var textString = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var textBytes = Encoding.Default.GetBytes(textString);
var japaneseTextFromString = japaneseEncoding.GetString(textBytes);
japaneseTextFromString.Dump();

Interestingly I think I need to read up on Encoding.Convert as it did not produce the behaviour I expected. The GetString methods seem to only work if I pass in bytes read in the Encoding.Default format - if I convert to the Japanese encoding beforehand they do not work as expected.

C#: Convert Japanese text encoding in shift-JIS and stored as ASCII into UTF-8

Tags:

c#

encoding

willvv

2 Answers

Hans Passant

Matt Mitchell

Recent Activity

Donate For Us

C#: Convert Japanese text encoding in shift-JIS and stored as ASCII into UTF-8

Tags:

c#

encoding

willvv

2 Answers

Hans Passant

Matt Mitchell

Related questions

Recent Activity

Donate For Us