I made the following "simulation": <pre class="prettyprint"><code>byte[] b = new byte[256]; for (int i = 0; i < 256; i ++) { b[i] = (byte) (i - 128); } byte[] transformed = new String(b, "cp1251").getBytes("cp1251"); for (int i = 0; i < b.length; i ++) { if (b[i] != transformed[i]) { System.out.println("Wrong : " + i); } } </code></pre> For <code>cp1251</code> this outputs only one wrong byte - at position 25. For <code>KOI8-R</code> - all fine. For <code>cp1252</code> - 4 or 5 differences. What is the reason for this and how can this be overcome? I know it is wrong to represent byte arrays as strings in whatever encoding, but it is a requirement of the protocol of a payment provider, so I don't have a choice. Update: representing it in <code>ISO-8859-1</code> works, and I'll use it for the <code>byte[]</code> part, and <code>cp1251</code> for the textual part, so the question remains only out of curiousity

Some of the "bytes" are not supported in the target set - they are replaced with the <code>?</code> character. When you convert back, <code>?</code> is normally converted to the byte value 63 - which isn't what it was before.

It appears that both cp1251 and cp1252 have byte values that do not correspond to defined characters; i.e. they are "unmappable". The javadoc for <code>String(byte[], String)</code> says this: <blockquote> The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The <code>CharsetDecoder</code> class should be used when more control over the decoding process is required. </blockquote> Other constructors say this: <blockquote> This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. </blockquote> If you see this kind of thing happening in practice it indicates that either you are using the wrong character set, or you've been given some bad data. Either way, it is probably not a good idea to carry on as if there was no problem. I've been trying to figure out if there is a way to get a CharsetDecoder to "preserve" unmappable characters, and I don't think it is possible unless you are willing to implementing a custom decoder/encoder pair. But I've also concluded that it does not make sense to even try. It is (theoretically) wrong map those unmappable characters to real Unicode code points. And if you do, how is your application going to handle them?

Why new String(bytes, enc).getBytes(enc) does not return the original byte array?

Q: How to Get String from bytes?

The simplest way to do so is using valueOf() method of String class in java. lang package. This method takes the byte value to be parsed and returns the value in String type from it.

Tags:

java

I made the following "simulation":

byte[] b = new byte[256];

for (int i = 0; i < 256; i ++) {
    b[i] = (byte) (i - 128);
}
byte[] transformed = new String(b, "cp1251").getBytes("cp1251");

for (int i = 0; i < b.length; i ++) {
    if (b[i] != transformed[i]) {
        System.out.println("Wrong : " + i);
    }
}

For cp1251 this outputs only one wrong byte - at position 25.
For KOI8-R - all fine.
For cp1252 - 4 or 5 differences.

What is the reason for this and how can this be overcome?

I know it is wrong to represent byte arrays as strings in whatever encoding, but it is a requirement of the protocol of a payment provider, so I don't have a choice.

Update: representing it in ISO-8859-1 works, and I'll use it for the byte[] part, and cp1251 for the textual part, so the question remains only out of curiousity

448

asked Mar 30 '10 12:03

Bozho

4 Answers

Some of the "bytes" are not supported in the target set - they are replaced with the ? character. When you convert back, ? is normally converted to the byte value 63 - which isn't what it was before.

180

answered Nov 15 '22 18:11

lexicore

What is the reason for this

The reason is that character encodings are not necesarily bijective and there is no good reason to expect them to be. Not all bytes or byte sequences are legal in all encodings, and usually illegal sequences are decoded to some sort of placeholder character like '?' or U+FFFD, which of course does not produce the same bytes when re-encoded.

Additionally, some encodings may map some legal different byte sequences to the same string.

answered Nov 15 '22 18:11

Michael Borgwardt

It appears that both cp1251 and cp1252 have byte values that do not correspond to defined characters; i.e. they are "unmappable".

The javadoc for String(byte[], String) says this:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.

Other constructors say this:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.

If you see this kind of thing happening in practice it indicates that either you are using the wrong character set, or you've been given some bad data. Either way, it is probably not a good idea to carry on as if there was no problem.

I've been trying to figure out if there is a way to get a CharsetDecoder to "preserve" unmappable characters, and I don't think it is possible unless you are willing to implementing a custom decoder/encoder pair. But I've also concluded that it does not make sense to even try. It is (theoretically) wrong map those unmappable characters to real Unicode code points. And if you do, how is your application going to handle them?

answered Nov 15 '22 18:11

Stephen C

Actually there shall be one difference: a byte of value 24 is converted to a char of value 0xFFFD; that's the "Unicode replacement character", used for untranslatable bytes. When converted back, you get a question mark (value 63).

In CP1251, the code 24 means "end of input" and cannot be part of a proper string, which is why Java deems it as "untranslatable".

answered Nov 15 '22 17:11

Thomas Pornin

Related questions
                            
                                How to generate offline Swagger API docs?
                            
                                AuthenticationSuccessHandler in Spring Security
                            
                                Enable CORS for OPTIONS request using Spring Framework
                            
                                create parquet files in java
                            
                                ZonedDateTime with MongoDB
                            
                                Protocol Buffers 3: Enums as keys in a map
                            
                                Cannot change dependencies of configuration ':compile' after it has been resolved
                            
                                How to do a bulk update in Firestore
                            
                                NullPointerException: element cannot be mapped to a null key
                            
                                Constructor overloading - best practice in Java [closed]
                            
                                How to solve error: "Resource IDs cannot be used in switch statement in Android library modules" [duplicate]
                            
                                Invalid byte tag in constant pool: 19 error message
                            
                                FileNotFoundException open failed: EPERM (Operation not permitted) during saving image file to internal storage on android
                            
                                How to avoid VS Code warning: "[myfile].java is a non-project file, only syntax errors are reported"
                            
                                How can I detect if caps lock is toggled in Swing?
                            
                                Is there Java library or framework for accessing Serial ports? [closed]
                            
                                Having 2 variables with the same name in a class that extends another class in Java
                            
                                Quick'n'dirty persistence [closed]
                            
                                How to initialize a ByteBuffer if you don't know how many bytes to allocate beforehand?
                            
                                Convert between URL and windows filename (Java)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why new String(bytes, enc).getBytes(enc) does not return the original byte array?

Tags:

java

Bozho

People also ask

4 Answers

lexicore

Michael Borgwardt

Stephen C

Thomas Pornin

Recent Activity

Donate For Us