How to convert text from utf8/cp1251(windows cyrillic) to DOS Cyrillic (cp866)
I find this example:
Charset fromCharset = Charset.forName("utf8");
Charset toCharset = Charset.forName("cp866");
String text1 = "Ðиколай"; // my name in bulgarian
String text2 = "Nikolay"; // my name in english
System.out.println("TEXT1 :[" + toCharset.decode(fromCharset.encode(text1)).toString() + "]");
System.out.println("TEXT2 :[" + toCharset.decode(fromCharset.encode(text2)).toString() + "]");
And the input is:
TEXT1 :[â¨Ðâ¨ââ¨ââ¨ââ¨ââ¨ââ¨â£] // WRONG
TEXT2 :[Nikolay] // CORRECT
Where is the problem?
UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points). A = 65, B = 66, C = 67, .... Encoding translates numbers into binary.
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
First of: if you've got a String
object, then it no longer has an encoding, it's a pure Unicode string(*)!
In Java, encodings are used only when you convert from bytes (byte[]
) to a string (String
) or vice versa. (You could theoretically do a direct conversion from byte[]
to byte[]
but I've yet to see that done in Java).
If you have some cp1251 encoded data, then it must be either a byte[]
(i.e. an array of bytes) or in some kind of stream (e.g. provided to you as an InputStream
).
If you want to provide some data as cp866, then you must provide it either as a byte[]
or as some kind of stream (e.g. an `OutputStream).
Also: there's no such thing as "utf8/cp1251". UTF-8 and CP-1251 are pretty much unrelated character encodings. Your input is either UTF-8 or CP-1251 (or something else). It can't really be both (+).
And here's the obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
(*) yes, strictly speaking it has an encoding and it is UTF-16, but for most purposes you can (and should) think of it as an "encodingless ideal Unicode String"
(+) strictly speaking it could be both if it's only using character that encode to the same bytes in both encodings, which is usually the ASCII subset
The problem is that you're trying to decode the output of one encoding as if it's a different one.
Imagine that you had a program which could only write out JPEGs, and another which could only read PNGs... would you expect to be able to read the output of the first program with the second?
In this case the two encodings happen to be compatible for ASCII characters, but fundamentally you're doing the wrong thing.
If you have text which is already in UTF-8, you should read that from binary data into a Unicode string using the UTF-8 encoding, and then write it out using your other encoding to binary data again. Unicode is the intermediate step basically, as Java's native text format. This would be the equivalent to loading the JPEG output into another program which could perform the conversion to PNG before you read it with the second app.
Short solve for your problem:
System.out.write("ÐÐСЯ\n".getBytes("cp866")); // its right
System.out.println("ÐÐСЯ".getBytes("cp866")); // its wrong
Result from cmd.exe:
C:\Documents and Settings\afram\Ðои докÑменÑÑ\NetBeansProjects\Encoding\dist>java -jar Encoding.jar
ÐÐСЯ
[B@1bab50a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With