I am trying to solve this interview question.
After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.
I have two possible approaches.
First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. So I will have to go through the string twice and the complexity will be O(n).
Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range.
I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question.
Thanks a lot.
How can I check if a string is in valid UTF-8 format? you mean byte[] is validly encoded? The simplest thing to do might be to decode it and encode it again. Check you get the same thing.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names.
Let's first have a look at a visual representation of the UTF-8 design.
Now let's resume what we have to do.
x
characters represent the actual codepoint. We will use the binary AND operator (&
) which copy a bit to the result if it exists in both operands.0b1xxxxxxx
where 1 will appear "Bytes in sequence" time, and other bits will be 0.The method would look like this :
public static final boolean isUTF8(final byte[] pText) {
int expectedLength = 0;
for (int i = 0; i < pText.length; i++) {
if ((pText[i] & 0b10000000) == 0b00000000) {
expectedLength = 1;
} else if ((pText[i] & 0b11100000) == 0b11000000) {
expectedLength = 2;
} else if ((pText[i] & 0b11110000) == 0b11100000) {
expectedLength = 3;
} else if ((pText[i] & 0b11111000) == 0b11110000) {
expectedLength = 4;
} else if ((pText[i] & 0b11111100) == 0b11111000) {
expectedLength = 5;
} else if ((pText[i] & 0b11111110) == 0b11111100) {
expectedLength = 6;
} else {
return false;
}
while (--expectedLength > 0) {
if (++i >= pText.length) {
return false;
}
if ((pText[i] & 0b11000000) != 0b10000000) {
return false;
}
}
}
return true;
}
Edit : The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per @EJP comment.
A small solution for real world UTF-8 compatibility checking:
public static final boolean isUTF8(final byte[] inputBytes) {
final String converted = new String(inputBytes, StandardCharsets.UTF_8);
final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
return Arrays.equals(inputBytes, outputBytes);
}
You can check the tests results:
@Test
public void testEnconding() {
byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };
assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
assertThat(isUTF8(validUTF8Bytes1)).isTrue();
assertThat(isUTF8(validUTF8Bytes2)).isTrue();
assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}
Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With