The following test fails on converted Latin1, because illegal characters are replaced with byte with the value 63 (question mark). The problem is that these characters should better cause some exception ...
@Test
public void testEncoding() throws UnsupportedEncodingException {
final String czech = "Řízeček a šampáňo a žízeň";
// okay
final byte[] bytesInLatin2 = czech.getBytes("ISO8859-2");
// different bytes, but okay
final byte[] bytesInWin1250 = czech.getBytes("Windows-1250");
// different bytes, but okay
final byte[] bytesInUtf8 = czech.getBytes("UTF-8");
// nonsense; Ř,č,... are not in Latin1 code set!!!
final byte[] bytesInLatin1 = czech.getBytes("ISO8859-1");
System.out.println(Arrays.toString(bytesInLatin2));
System.out.println(Arrays.toString(bytesInWin1250));
System.out.println(Arrays.toString(bytesInUtf8));
System.out.println(Arrays.toString(bytesInLatin1));
System.out.flush();
final String latin2 = new String(bytesInLatin2, "ISO8859-2");
final String win1250 = new String(bytesInWin1250, "Windows-1250");
final String utf8 = new String(bytesInUtf8, "UTF-8");
final String latin1 = new String(bytesInLatin1, "ISO8859-1");
Assert.assertEquals("latin2", czech, latin2);
Assert.assertEquals("win1250", czech, win1250);
Assert.assertEquals("utf8", czech, utf8);
Assert.assertEquals("latin1", czech, latin1); // this test will fail!
}
There are many situations where the data are finally corrupted because of this behaviour of Java. Is there any library available to validate Strings if they are encodable with some encoding?
In PHP, the mb_check_encoding() function is used to check if the given strings are valid for the specified encoding. This function checks if the specified byte stream is valid for the specified encoding.
Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.
Encoding is a way to convert data from one format to another. String objects use UTF-16 encoding.
String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data.
I suspect you're looking for CharsetEncoder.canEncode(CharSequence)
.
Charset latin2 = Charset.forName("ISO8859-2");
boolean validInLatin2 = latin2.newEncoder().canEncode(czech);
...
As an alternative to Jon Skeet's suggestion, you can also use CharsetEncoder class to do the encoding directly (with the encode method), but first call the onMalformedInput and onUnmappableCharacter methods to specify what the encoder should do when it encounters bad input.
That way most of the time you're just doing a simple encode call, but if anything goes wrong you'll get an exception.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With