Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java char set encoding problem(from UTF8 to cp866)

How to convert text from utf8/cp1251(windows cyrillic) to DOS Cyrillic (cp866)

I find this example:

Charset fromCharset = Charset.forName("utf8");
Charset toCharset = Charset.forName("cp866");

String text1 = "Николай"; // my name in bulgarian
String text2 = "Nikolay"; // my name in english

System.out.println("TEXT1 :[" + toCharset.decode(fromCharset.encode(text1)).toString() + "]");
System.out.println("TEXT2 :[" + toCharset.decode(fromCharset.encode(text2)).toString() + "]");

And the input is:

TEXT1 :[╨Э╨╕╨║╨╛╨╗╨░╨╣] // WRONG
TEXT2 :[Nikolay]  // CORRECT

Where is the problem?

like image 343
NikolayGS Avatar asked Jan 24 '11 13:01

NikolayGS


People also ask

Is UTF-8 character set or encoding?

UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points). A = 65, B = 66, C = 67, .... Encoding translates numbers into binary.

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.


3 Answers

First of: if you've got a String object, then it no longer has an encoding, it's a pure Unicode string(*)!

In Java, encodings are used only when you convert from bytes (byte[]) to a string (String) or vice versa. (You could theoretically do a direct conversion from byte[] to byte[] but I've yet to see that done in Java).

If you have some cp1251 encoded data, then it must be either a byte[] (i.e. an array of bytes) or in some kind of stream (e.g. provided to you as an InputStream).

If you want to provide some data as cp866, then you must provide it either as a byte[] or as some kind of stream (e.g. an `OutputStream).

Also: there's no such thing as "utf8/cp1251". UTF-8 and CP-1251 are pretty much unrelated character encodings. Your input is either UTF-8 or CP-1251 (or something else). It can't really be both (+).

And here's the obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

(*) yes, strictly speaking it has an encoding and it is UTF-16, but for most purposes you can (and should) think of it as an "encodingless ideal Unicode String"
(+) strictly speaking it could be both if it's only using character that encode to the same bytes in both encodings, which is usually the ASCII subset

like image 72
Joachim Sauer Avatar answered Nov 01 '22 22:11

Joachim Sauer


The problem is that you're trying to decode the output of one encoding as if it's a different one.

Imagine that you had a program which could only write out JPEGs, and another which could only read PNGs... would you expect to be able to read the output of the first program with the second?

In this case the two encodings happen to be compatible for ASCII characters, but fundamentally you're doing the wrong thing.

If you have text which is already in UTF-8, you should read that from binary data into a Unicode string using the UTF-8 encoding, and then write it out using your other encoding to binary data again. Unicode is the intermediate step basically, as Java's native text format. This would be the equivalent to loading the JPEG output into another program which could perform the conversion to PNG before you read it with the second app.

like image 28
Jon Skeet Avatar answered Nov 01 '22 21:11

Jon Skeet


Short solve for your problem:

 System.out.write("ВАСЯ\n".getBytes("cp866")); // its right
 System.out.println("ВАСЯ".getBytes("cp866")); // its wrong

Result from cmd.exe:

C:\Documents and Settings\afram\Мои документы\NetBeansProjects\Encoding\dist>java -jar Encoding.jar

ВАСЯ

[B@1bab50a

like image 43
basil Avatar answered Nov 01 '22 21:11

basil