Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?
The array may be generated by code similar to:
byte[] utf8 = "Hello World".getBytes("UTF-8");
Alternatively it may have been generated by code similar to:
byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
messageContent[i] = (byte) i;
}
The key point is that we don't know what the array contains but need to find out in order to fill in the following function:
public final String getString(final byte[] dataToProcess) {
// Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
// If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
// If dataToProcess contains an encoded string then we will decode it and return.
}
How would this be extended to also cover UTF-16 or other encoding mechanisms?
A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.
Any ASCII string is a valid UTF-8 string. An ASCII character is simply a byte value in [0,127] or [0x00, 0x7F] in hexadecimal. That is, the most significant bit is always zero. However, there are many more unicode characters than can be represented using a single byte.
Since bytes is the binary data while String is character data. It is important to know the original encoding of the text from which the byte array has created. When we use a different character encoding, we do not get the original string back.
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.
If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.
However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With