Verifying a string is UTF-8 encoded in Java

Question

There are plenty of to how to check if a string is UTF-8 encoded, for example:

public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}

The doc of java.lang.String#getBytes(java.nio.charset.Charset) says:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

Is it correct that it always returns correct UTF-8 bytes?
Does it make sense to perform such checks on String objects at all? Won't it always be returning true as a String object is already encoded?
As far as I understand such checks should be performed on bytes, not on String objects:

public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

But in this case I'm not sure I understand where I should take those butes from as getting it straight from the String object will no be correct.

Andreas · Accepted Answer

Is it correct that it always returns correct UTF-8 bytes?

Yes.

Does it make sense to perform such checks on String objects at all? Won't it always be returning true as a String object is already encoded?

Java strings use Unicode characters encoded in UTF-16. Since UTF-16 uses surrogate pairs, any unpaired surrogate is invalid, so Java strings can contain invalid char sequences.

Java strings can also contain characters that are unassigned in Unicode.

Which means that performing validation on a Java String makes sense, though it is very rarely done.

As far as I understand such checks should be performed on bytes, not on String objects.

Depending on the character set of the bytes, there is nothing to validate, e.g. character set CP437 maps all 256 byte values, so it cannot be invalid.

UTF-8 can be invalid, so you're correct that validating bytes is useful.

As the javadoc said, getBytes(Charset) always replaces malformed-input and unmappable-character sequences with the charset's default replacement byte.

That is because it does this:

CharsetEncoder encoder = charset.newEncoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE);

If you want to get the bytes, but fail on malformed-input and unmappable-character sequences, use CodingErrorAction.REPORT instead. Since that's actually the default, simply don't call the two onXxx() methods.

Example

String s = "\uD800"; // unpaired surrogate
System.out.println(Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));

That prints [63] which is a ?, i.e. the unpaired surrogate is malformed-input, so it was replaced with the replacement byte.

String s = "\uD800"; // unpaired surrogate

CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
ByteBuffer encoded = encoder.encode(CharBuffer.wrap(s.toCharArray()));
byte[] bytes = new byte[encoded.remaining()];
encoded.get(bytes);

System.out.println(Arrays.toString(bytes));

That causes MalformedInputException: Input length = 1 since the default malformed-input action is REPORT.

Verifying a string is UTF-8 encoded in Java

Tags:

java

string

encoding

utf-8

utf-16

void

1 Answers

Andreas

Recent Activity

Donate For Us

Verifying a string is UTF-8 encoded in Java

Tags:

java

string

encoding

utf-8

utf-16

void

1 Answers

Andreas

Related questions

Recent Activity

Donate For Us