Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF32 and C# problems

So I've got some troubles with character encoding. When I put the following two characters into a UTF32 encoded text file:

𩸕
鸕

and then run this code on them:

System.IO.StreamReader streamReader = 
    new System.IO.StreamReader("input", System.Text.Encoding.UTF32, false);
System.IO.StreamWriter streamWriter = 
    new System.IO.StreamWriter("output", false, System.Text.Encoding.UTF32);
    
streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

I get:

鸕
鸕

(same character twice, i.e the input file != output)

A few things that might help: Hex for the first character:

15 9E 02 00

And for the second:

15 9E 00 00

I am using gedit for the text file creation, mono for the C# and I'm using Ubuntu.

It also doesn't matter if I specify the encoding for the input or output file, it just doesn't like it if it's in UTF32 encoding. It works if the input file is in UTF-8 encoding.

The input file is as follows:

FF FE 00 00 15 9E 02 00 0A 00 00 00 15 9E 00 00 0A 00 00 00

Is it a bug, or is it just me?

Thanks!

like image 724
AStupidNoob Avatar asked Apr 03 '12 05:04

AStupidNoob


2 Answers

K, so I figured it out I think, it seems to work now. Turns out, since the codes for the characters were 15 9E 02 00 and 15 9E 00 00, then there's no way that they can be held in one, single UTF-16 char. So, instead UTF16 uses these surrogate pairs things where there's two different characters that act as one 'element'. To get elements, we can use:

StringInfo.GetTextElementEnumerator(string fred);

and this returns a string with the surrogate pairs. Treat it as one character.

See here:

http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.gettextelement.aspx

Hope it helps someone :D

like image 166
AStupidNoob Avatar answered Sep 19 '22 01:09

AStupidNoob


I tried this and it works well on my PC.

System.IO.StreamReader streamReader = new System.IO.StreamReader("input", true);
System.IO.StreamWriter streamWriter = new System.IO.StreamWriter("output", false);

streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

Maybe the text you think is in UTF32 is not.

like image 35
Chibueze Opata Avatar answered Sep 23 '22 01:09

Chibueze Opata