Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I check if a String is encodable in some encoding?

The following test fails on converted Latin1, because illegal characters are replaced with byte with the value 63 (question mark). The problem is that these characters should better cause some exception ...

  @Test
  public void testEncoding() throws UnsupportedEncodingException {
    final String czech = "Řízeček a šampáňo a žízeň";
    // okay
    final byte[] bytesInLatin2 = czech.getBytes("ISO8859-2");
    // different bytes, but okay
    final byte[] bytesInWin1250 = czech.getBytes("Windows-1250");
    // different bytes, but okay
    final byte[] bytesInUtf8 = czech.getBytes("UTF-8");
    // nonsense; Ř,č,... are not in Latin1 code set!!!
    final byte[] bytesInLatin1 = czech.getBytes("ISO8859-1");

    System.out.println(Arrays.toString(bytesInLatin2));
    System.out.println(Arrays.toString(bytesInWin1250));
    System.out.println(Arrays.toString(bytesInUtf8));
    System.out.println(Arrays.toString(bytesInLatin1));
    System.out.flush();

    final String latin2 = new String(bytesInLatin2, "ISO8859-2");
    final String win1250 = new String(bytesInWin1250, "Windows-1250");
    final String utf8 = new String(bytesInUtf8, "UTF-8");
    final String latin1 = new String(bytesInLatin1, "ISO8859-1");

    Assert.assertEquals("latin2", czech, latin2);
    Assert.assertEquals("win1250", czech, win1250);
    Assert.assertEquals("utf8", czech, utf8);
    Assert.assertEquals("latin1", czech, latin1); // this test will fail!
  }

There are many situations where the data are finally corrupted because of this behaviour of Java. Is there any library available to validate Strings if they are encodable with some encoding?

like image 872
dmatej Avatar asked Jun 03 '13 17:06

dmatej


People also ask

How do you check if a string is encoded?

In PHP, the mb_check_encoding() function is used to check if the given strings are valid for the specified encoding. This function checks if the specified byte stream is valid for the specified encoding.

How do I know if a string is UTF-8?

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.

What encoding is a string?

Encoding is a way to convert data from one format to another. String objects use UTF-16 encoding.

What encoding does Java use for strings?

String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data.


2 Answers

I suspect you're looking for CharsetEncoder.canEncode(CharSequence).

Charset latin2 = Charset.forName("ISO8859-2");
boolean validInLatin2 = latin2.newEncoder().canEncode(czech);
...
like image 80
Jon Skeet Avatar answered Oct 29 '22 13:10

Jon Skeet


As an alternative to Jon Skeet's suggestion, you can also use CharsetEncoder class to do the encoding directly (with the encode method), but first call the onMalformedInput and onUnmappableCharacter methods to specify what the encoder should do when it encounters bad input.

That way most of the time you're just doing a simple encode call, but if anything goes wrong you'll get an exception.

like image 27
James Holderness Avatar answered Oct 29 '22 12:10

James Holderness