Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found. byte[] utf8 = ... byte[] latin1 = new String(utf8, "UTF-8"). getBytes("ISO-8859-1"); You can exercise more control by using the lower-level Charset APIs.
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
UTF-8 uses one byte to represent code points from 0-127. These first 128 Unicode code points correspond one-to-one with ASCII character mappings, so ASCII characters are also valid UTF-8 characters.
Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.
If you're dealing with character encodings other than UTF-16, you shouldn't be using java.lang.String
or the char
primitive -- you should only be using byte[]
arrays or ByteBuffer
objects. Then, you can use java.nio.charset.Charset
to convert between encodings:
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xC3, (byte)0xA2});
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
byte[] iso88591Data = theString.getBytes("ISO-8859-1");
Will do the trick. From your description it seems as if you're trying to "store an ISO-8859-1 String". String objects in Java are always implicitly encoded in UTF-16. There's no way to change that encoding.
What you can do, 'though is to get the bytes that constitute some other encoding of it (using the .getBytes()
method as shown above).
Starting with a set of bytes which encode a string using UTF-8, creates a string from that data, then get some bytes encoding the string in a different encoding:
byte[] utf8bytes = { (byte)0xc3, (byte)0xa2, 0x61, 0x62, 0x63, 0x64 };
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
String string = new String ( utf8bytes, utf8charset );
System.out.println(string);
// "When I do a getbytes(encoding) and "
byte[] iso88591bytes = string.getBytes(iso88591charset);
for ( byte b : iso88591bytes )
System.out.printf("%02x ", b);
System.out.println();
// "then create a new string with the bytes in ISO-8859-1 encoding"
String string2 = new String ( iso88591bytes, iso88591charset );
// "I get a two different chars"
System.out.println(string2);
this outputs strings and the iso88591 bytes correctly:
âabcd
e2 61 62 63 64
âabcd
So your byte array wasn't paired with the correct encoding:
String failString = new String ( utf8bytes, iso88591charset );
System.out.println(failString);
Outputs
âabcd
(either that, or you just wrote the utf8 bytes to a file and read them elsewhere as iso88591)
This is what I needed:
public static byte[] encode(byte[] arr, String fromCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName("UTF-8"));
}
public static byte[] encode(byte[] arr, String fromCharsetName, String targetCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName(targetCharsetName));
}
public static byte[] encode(byte[] arr, Charset sourceCharset, Charset targetCharset) {
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
CharBuffer data = sourceCharset.decode(inputBuffer);
ByteBuffer outputBuffer = targetCharset.encode(data);
byte[] outputData = outputBuffer.array();
return outputData;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With