Java char set encoding problem(from UTF8 to cp866)

Tags:

character-encoding

How to convert text from utf8/cp1251(windows cyrillic) to DOS Cyrillic (cp866)

I find this example:

Charset fromCharset = Charset.forName("utf8");
Charset toCharset = Charset.forName("cp866");

String text1 = "ÐÐ¸ÐºÐ¾Ð»Ð°Ð¹"; // my name in bulgarian
String text2 = "Nikolay"; // my name in english

System.out.println("TEXT1 :[" + toCharset.decode(fromCharset.encode(text1)).toString() + "]");
System.out.println("TEXT2 :[" + toCharset.decode(fromCharset.encode(text2)).toString() + "]");

And the input is:

TEXT1 :[â¨Ðâ¨ââ¨ââ¨ââ¨ââ¨ââ¨â£] // WRONG
TEXT2 :[Nikolay]  // CORRECT

Where is the problem?

343

asked Jan 24 '11 13:01

3 Answers

First of: if you've got a String object, then it no longer has an encoding, it's a pure Unicode string(*)!

In Java, encodings are used only when you convert from bytes (byte[]) to a string (String) or vice versa. (You could theoretically do a direct conversion from byte[] to byte[] but I've yet to see that done in Java).

If you have some cp1251 encoded data, then it must be either a byte[] (i.e. an array of bytes) or in some kind of stream (e.g. provided to you as an InputStream).

If you want to provide some data as cp866, then you must provide it either as a byte[] or as some kind of stream (e.g. an `OutputStream).

Also: there's no such thing as "utf8/cp1251". UTF-8 and CP-1251 are pretty much unrelated character encodings. Your input is either UTF-8 or CP-1251 (or something else). It can't really be both (+).

And here's the obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

(*) yes, strictly speaking it has an encoding and it is UTF-16, but for most purposes you can (and should) think of it as an "encodingless ideal Unicode String"
(+) strictly speaking it could be both if it's only using character that encode to the same bytes in both encodings, which is usually the ASCII subset

answered Nov 01 '22 22:11

Joachim Sauer

The problem is that you're trying to decode the output of one encoding as if it's a different one.

Imagine that you had a program which could only write out JPEGs, and another which could only read PNGs... would you expect to be able to read the output of the first program with the second?

In this case the two encodings happen to be compatible for ASCII characters, but fundamentally you're doing the wrong thing.

If you have text which is already in UTF-8, you should read that from binary data into a Unicode string using the UTF-8 encoding, and then write it out using your other encoding to binary data again. Unicode is the intermediate step basically, as Java's native text format. This would be the equivalent to loading the JPEG output into another program which could perform the conversion to PNG before you read it with the second app.

answered Nov 01 '22 21:11

Jon Skeet

Short solve for your problem:

 System.out.write("ÐÐÐ¡Ð¯\n".getBytes("cp866")); // its right
 System.out.println("ÐÐÐ¡Ð¯".getBytes("cp866")); // its wrong

Result from cmd.exe:

C:\Documents and Settings\afram\ÐÐ¾Ð¸ Ð´Ð¾ÐºÑÐ¼ÐµÐ½ÑÑ\NetBeansProjects\Encoding\dist>java -jar Encoding.jar

ÐÐÐ¡Ð¯

[B@1bab50a

answered Nov 01 '22 21:11

basil

Related questions
                            
                                Automatic generation of Unit test cases for .NET and Java [closed]
                            
                                Java generics to enforce return type of abstract method
                            
                                What's the most elegant way to concatenate a list of values with delimiter in Java?
                            
                                itext multiline text in bounding box
                            
                                How to reinterpret the bits of a float as an int
                            
                                Java: using endpoint to publish webservice to tomcat server
                            
                                Readkey in Java
                            
                                Silent mode execution with java.exe
                            
                                Can I get the full query that a PreparedStatement is about to execute? [duplicate]
                            
                                What exactly happens when you have final values and inner classes in a method?
                            
                                Using global exception handling with "setUncaughtExceptionHandler" and "Toast"
                            
                                Deserializing json array using gson
                            
                                Hashtable with integer key in Java
                            
                                JPA: question about merging an entity before removing it
                            
                                Java key listener in Commandline
                            
                                How can I tell Struts2 not to validate a form the first time it displays?
                            
                                Thread-safe Date Parser
                            
                                How to use references in Java?
                            
                                Change the priority level in Log4j
                            
                                Parsing an XML file with a DTD schema on a relative path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java char set encoding problem(from UTF8 to cp866)

Tags:

java

character-encoding

NikolayGS

People also ask

3 Answers

Joachim Sauer

Jon Skeet

basil

Recent Activity

Donate For Us