Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I check whether a byte array contains a Unicode string in Java?

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?

The array may be generated by code similar to:

byte[] utf8 = "Hello World".getBytes("UTF-8");

Alternatively it may have been generated by code similar to:

byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;
}

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:

public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.
}

How would this be extended to also cover UTF-16 or other encoding mechanisms?

like image 668
Iain Avatar asked Jul 28 '09 10:07

Iain


People also ask

What is the difference between byte string and Unicode string?

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.

How do I check if a string is UTF-8?

Any ASCII string is a valid UTF-8 string. An ASCII character is simply a byte value in [0,127] or [0x00, 0x7F] in hexadecimal. That is, the most significant bit is always zero. However, there are many more unicode characters than can be represented using a single byte.

Is byte array same as string?

Since bytes is the binary data while String is character data. It is important to know the original encoding of the text from which the byte array has created. When we use a different character encoding, we do not get the original string back.

Does Java use UTF-8 or UTF-16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.


1 Answers

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.

If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.

However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

like image 170
Michael Borgwardt Avatar answered Oct 03 '22 00:10

Michael Borgwardt