What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode
isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.
Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.
If our byte is positive (8th bit set to 0), this mean that it's an ASCII character. if ( myByte >= 0 ) return myByte; Codes greater than 127 are encoded into several bytes. On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.
$ iconv -f UTF-8 your_file > /dev/null; echo $? The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred. Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
chardet character set detection developed by Mozilla used in FireFox. Source code
jchardet is a java port of the source from mozilla's automatic charset detection algorithm.
NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.
Code project C# sample that uses Microsoft's MLang for character encoding detection.
UTRAC is a command line tool and library written in c++ to detect string encoding
cpdetector is a java project used for encoding detection
chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.
Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.
There is no really reliable way, but basically, as a random sequence of bytes (e.g. a string in an standard 8 bit encoding) is very unlikely to be a valid UTF-8 string (if the most significant bit of a byte is set, there are very specific rules as to what kind of bytes can follow it in UTF-8), you can try decoding the string as UTF-8 and consider that it is UTF-8 if there are no decoding errors.
Determining if there were decoding errors is another problem altogether, many Unicode libraries simply replace invalid characters with a question mark without indicating whether or not an error occurred. So you need an explicit way of determining if an error occurred while decoding or not.
This W3C page has a perl regular expression for validating UTF-8
You didn't specify a language, but in PHP you can use mb_check_encoding
if(mb_check_encoding($yourDtring, 'UTF-8'))
{
//the string is UTF-8
}
else
{
//string is not UTF-8
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With