Howto identify UTF-8 encoded strings

Tags:

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.

752

asked Dec 18 '08 09:12

Johann Gerell

4 Answers

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.

Code project C# sample that uses Microsoft's MLang for character encoding detection.

UTRAC is a command line tool and library written in c++ to detect string encoding

cpdetector is a java project used for encoding detection

chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.

Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.

answered Oct 12 '22 03:10

Edward Wilde

There is no really reliable way, but basically, as a random sequence of bytes (e.g. a string in an standard 8 bit encoding) is very unlikely to be a valid UTF-8 string (if the most significant bit of a byte is set, there are very specific rules as to what kind of bytes can follow it in UTF-8), you can try decoding the string as UTF-8 and consider that it is UTF-8 if there are no decoding errors.

Determining if there were decoding errors is another problem altogether, many Unicode libraries simply replace invalid characters with a question mark without indicating whether or not an error occurred. So you need an explicit way of determining if an error occurred while decoding or not.

answered Oct 12 '22 02:10

Laurent

This W3C page has a perl regular expression for validating UTF-8

answered Oct 12 '22 02:10

hamishmcn

You didn't specify a language, but in PHP you can use mb_check_encoding

   if(mb_check_encoding($yourDtring, 'UTF-8'))
   {
   //the string is UTF-8
    }
   else 
    {
       //string is not UTF-8
     }

answered Oct 12 '22 03:10

Ryan

Related questions
                            
                                Unicode input retrieved via PrimeFaces input components become corrupted
                            
                                How to use five digit long Unicode characters in JavaScript
                            
                                How do I convert a unicode to a string at the Python level?
                            
                                Convert unicode codepoint to UTF8 hex in python
                            
                                Storing unicode UTF-8 string in std::string
                            
                                How to convert utf-8 fancy quotes to neutral quotes
                            
                                CSS - change dropdown arrow to unicode triangle
                            
                                How do you set strings to uppercase / lowercase in Unicode?
                            
                                char to Unicode more than U+FFFF in java?
                            
                                How do I correctly insert unicode in an HTML title using JavaScript?
                            
                                Why are certain Unicode characters causing std::wcout to fail in a console app?
                            
                                Classic ASP: How to write unicode string data in classic ASP?
                            
                                How to replace special characters with their equivalent (such as " á " for " a") in C#?
                            
                                Python - can I detect unicode string language code?
                            
                                using preg_match to detect persian (farsi) characters in string
                            
                                How do I tell Python that sys.argv is in Unicode?
                            
                                How to represent tally/five-bar-gate in unicode?
                            
                                UnicodeDecodeError while using json.dumps() [duplicate]
                            
                                How to convert list of bytes (unicode) to Python string?
                            
                                text with unicode escape sequences to unicode in python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Howto identify UTF-8 encoded strings

Tags:

encoding

unicode

utf-8