Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C#: Convert Japanese text encoding in shift-JIS and stored as ASCII into UTF-8

Tags:

c#

encoding

I am trying to convert an old application that has some strings stored in the database as ASCII.

For example, the string: ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð is stored in the database.

Now, if I copy that string in a text editor and save it as ASCII and then open the file in a web browser and set it to automatically detect the Encoding, I get the correct string in japanese: チャネルパートナーの選択, and the page says that the detected encoding is Japanese (Shift_JIS).

When I try to do the conversion in the C# code doing something like this:

var asciiBytes = Encoding.ASCII.GetBytes(text);
var japaneseEncoding = Encoding.GetEncoding(932);
var convertedBytes = Encoding.Convert(japaneseEncoding, Encoding.ASCII, asciiBytes);
var japaneseString = japaneseEncoding.GetString(convertedBytes);

I get ?`???l???p?[?g?i?[???I?? as the japanese String and thus I cannot show it on the webpage.

Any light would be appreciated.

Thanks

like image 431
willvv Avatar asked Nov 12 '13 01:11

willvv


2 Answers

some strings stored in the database as ASCII

It isn't ASCII, about none of the characters in ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð are ASCII. Encoding.ASCII.GetBytes(text) is going to produce a lot of huh? characters, that's why you got all those question marks.

The core issue is that the bytes in the dbase column were read with the wrong encoding. You used code page 1252:

var badstringFromDatabase = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var hopefullyRecovered = Encoding.GetEncoding(1252).GetBytes(badstringFromDatabase);
var oughtToBeJapanese = Encoding.GetEncoding(932).GetString(hopefullyRecovered);

Which produces "チャネルパートナーの選択"

This is not going to be completely reliable, code page 1252 has a few unassigned codes that are used in 932. You'll end up with a garbled string from which you cannot recover the original byte value anymore. You'll need to focus on getting the data provider to use the correct encoding.

like image 106
Hans Passant Avatar answered Sep 29 '22 10:09

Hans Passant


As per the other answer, I'm pretty sure you're using ANSI/Default encoding not ASCII.

The following examples seem to get you what you're after.

var japaneseEncoding = Encoding.GetEncoding(932);

// From file bytes
var fileBytes = File.ReadAllBytes(@"C:\temp\test.html");
var japaneseTextFromFile = japaneseEncoding.GetString(fileBytes);
japaneseTextFromFile.Dump();

// From string bytes
var textString = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var textBytes = Encoding.Default.GetBytes(textString);
var japaneseTextFromString = japaneseEncoding.GetString(textBytes);
japaneseTextFromString.Dump();

Interestingly I think I need to read up on Encoding.Convert as it did not produce the behaviour I expected. The GetString methods seem to only work if I pass in bytes read in the Encoding.Default format - if I convert to the Japanese encoding beforehand they do not work as expected.

like image 21
Matt Mitchell Avatar answered Sep 29 '22 09:09

Matt Mitchell