Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get file encoding [duplicate]

Possible Duplicate:
Detect file encoding in PHP

How can I figure out with PHP what file encoding a file has?

like image 244
powtac Avatar asked Dec 13 '22 03:12

powtac


1 Answers

Detecting the encoding is really hard for all 8 bit character sets but utf-8 (because not every 8 bit byte sequence is valid utf-8) and usually requires semantic knowledge of the text for which the encoding is to be detected.

Think of it: Any particular plain text information is just a bunch of bytes with no encoding information associated. If you look at any particular byte, it could mean anything, so to have a chance at detecting the encoding, you would have to look at that byte in context of other bytes and try some heuristics based on possible language combination.

For 8bit character sets you can never be sure though.

A demonstration of heuristics going wrong is here for example:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

Some 16bit sets, you have a chance at detecting because they might include a byte order mark or have every second byte set to 0.

If you just want to detect UTF-8, you can either use mb_detect_encoding as already explained, or you can use this handy little function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
    )+%xs', $string);
}
like image 61
pilif Avatar answered Dec 28 '22 10:12

pilif