I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.
This function seems to work in all cases I tested.
Here is the question: Get file encoding
Here is the function:
function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]              # Non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]         # Excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # Straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]         # Excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}      # Planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}          # Planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}      # Plane 16
    )+%xs', $string);
}
Is this a reliable way of detecting UTF-8 strings? What exactly is it doing? Can it be made more robust?
is_utf8() – check for UTF-8 With this PHP function it's possible to check whether a string is encoded as UTF-8 or not, or seems to be, at least. It scans a string for invalid UTF-8 characters (or bytes) and returns false, if it finds any.
The utf8_encode() function is an inbuilt function in PHP which is used to encode an ISO-8859-1 string to UTF-8.
There is absolutely no difference in this case; UTF-8 is identical to ASCII in this character range. If storage is an important consideration, maybe look into compression. A simple Huffman compression will use something like 3 bits per byte for this kind of data.
UTF-8 Characters in Web DevelopmentUTF-8 is the most common character encoding method used on the internet today, and is the default character set for HTML5. Over 95% of all websites, likely including your own, store characters this way.
If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.
In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With