Implement a function to check if a string/byte array follows utf-8 format

Tags:

I am trying to solve this interview question.

After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.

I have two possible approaches.

First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. So I will have to go through the string twice and the complexity will be O(n).

Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range.

I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question.

Thanks a lot.

653

asked Mar 06 '15 01:03

DoraShine

2 Answers

Let's first have a look at a visual representation of the UTF-8 design.

enter image description here

Now let's resume what we have to do.

Loop over all character of the string (each character being a byte).
We will need to apply a mask to each byte depending on the codepoint as the x characters represent the actual codepoint. We will use the binary AND operator (&) which copy a bit to the result if it exists in both operands.
The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. We will do the bitwise operation using 0b1xxxxxxx where 1 will appear "Bytes in sequence" time, and other bits will be 0.
We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte.
If the character entered in none of the case, it means the byte is invalid and we return "No".
If we can get out of the loop, that means each of the character are valid, hence the string is valid.
Make sure the comparison that returned true correspond to the expected length.

The method would look like this :

public static final boolean isUTF8(final byte[] pText) {

    int expectedLength = 0;

    for (int i = 0; i < pText.length; i++) {
        if ((pText[i] & 0b10000000) == 0b00000000) {
            expectedLength = 1;
        } else if ((pText[i] & 0b11100000) == 0b11000000) {
            expectedLength = 2;
        } else if ((pText[i] & 0b11110000) == 0b11100000) {
            expectedLength = 3;
        } else if ((pText[i] & 0b11111000) == 0b11110000) {
            expectedLength = 4;
        } else if ((pText[i] & 0b11111100) == 0b11111000) {
            expectedLength = 5;
        } else if ((pText[i] & 0b11111110) == 0b11111100) {
            expectedLength = 6;
        } else {
            return false;
        }

        while (--expectedLength > 0) {
            if (++i >= pText.length) {
                return false;
            }
            if ((pText[i] & 0b11000000) != 0b10000000) {
                return false;
            }
        }
    }

    return true;
}

Edit : The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per @EJP comment.

answered Sep 28 '22 01:09

Jean-François Savard

A small solution for real world UTF-8 compatibility checking:

public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

You can check the tests results:

@Test
public void testEnconding() {

    byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
    byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
    byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
    byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };

    assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
    assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
    assertThat(isUTF8(validUTF8Bytes1)).isTrue();
    assertThat(isUTF8(validUTF8Bytes2)).isTrue();
    assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}

Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

answered Sep 28 '22 02:09

Thiago Mata

Related questions
                            
                                Embedding Jetty 9 - where is jetty-all.jar?
                            
                                Speak Failed Not Bound to TTS Engine
                            
                                JDK 1.7 breaks backward compatibility? (generics)
                            
                                Where to get the jar for openCV? [closed]
                            
                                Log4j 2. How get log4j's debug messages?
                            
                                error: unreported exception FileNotFoundException; must be caught or declared to be thrown
                            
                                Maven Failsafe Plugin: how to use the pre- and post-integration-test phases
                            
                                Error creating bean with name 'entityManagerFactory
                            
                                Dynamic @Value-equivalent in Spring?
                            
                                Best practice to Serialize java.time.LocalDateTime (java 8) to js Date using GSON
                            
                                Can BufferedReader read bytes?
                            
                                Convert Exception to JSON
                            
                                Is there an aggregateBy method in the stream Java 8 api?
                            
                                Reload or refresh a Spring application context inside a test method?
                            
                                How to calculate position on a circle with a certain angle?
                            
                                Cannot work with Jackson
                            
                                Implementation of java.util.Random.nextInt
                            
                                why linkedhashmap maintains doubly linked list for iteration
                            
                                Difference between SimpleStringProperty and StringProperty
                            
                                Default imports in Eclipse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Implement a function to check if a string/byte array follows utf-8 format

Tags:

java

string

utf-8

DoraShine

People also ask

2 Answers

Jean-François Savard

Thiago Mata

Recent Activity

Donate For Us