How can I check whether a byte array contains a Unicode string in Java?

Tags:

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?

The array may be generated by code similar to:

byte[] utf8 = "Hello World".getBytes("UTF-8");

Alternatively it may have been generated by code similar to:

byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;
}

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:

public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.
}

How would this be extended to also cover UTF-16 or other encoding mechanisms?

668

asked Jul 28 '09 10:07

Iain

1 Answers

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.

If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.

However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

170

answered Oct 03 '22 00:10

Michael Borgwardt

Related questions
                            
                                Creating an NDEF WiFi record using application/vnd.wfa.wsc in Android
                            
                                Logback-test.xml configuration is producing two log files instead of one?
                            
                                Make ApplicationContext dirty before and after test class
                            
                                JPA concurrency issue "On release of batch it still contained JDBC statements"
                            
                                JPA CriteriaQuery implements Spring Data Pageable.getSort()
                            
                                Update one document by _id (Invalid BSON field name _id)
                            
                                What's the difference between fromCallable and defer?
                            
                                Making a REST Call to Endpoint in Dockers
                            
                                What's the purpose of 'uses' directive in Java 9?
                            
                                How to export all packages from Java 9 module? [duplicate]
                            
                                How to retry with hystrix
                            
                                error: package org.slf4j does not exist
                            
                                Throwing Exceptions with Mockito in Kotlin
                            
                                SpringBoot - Error parsing HTTP request header
                            
                                Change color and format of java.util.logging.Logger output in Eclipse
                            
                                Lombok assign custom logger variable name
                            
                                Problem with Commons Logging / Log4j setup in spring webapp with tomcat 6
                            
                                can I reflectively instantiate a generic type in java?
                            
                                Merge Two XML Files in Java
                            
                                Why doesn't Java have constants for well-known system property names? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I check whether a byte array contains a Unicode string in Java?

Tags:

java

regex

unicode

utf-8

Iain

People also ask

1 Answers

Michael Borgwardt

Recent Activity

Donate For Us