I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS. This function seems to work in all cases I tested. Here is the question: Get file encoding Here is the function: <pre class="prettyprint"><code>function isUTF8($string){ return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # Non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # Excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # Straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # Excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # Planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # Planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # Plane 16 )+%xs', $string); } </code></pre> Is this a reliable way of detecting UTF-8 strings? What exactly is it doing? Can it be made more robust?

If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why <code>mb_detect_encoding</code> simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using <code>mb_check_encoding</code>. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences. In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.

Is testing for UTF-8 strings in PHP a reliable method?

Tags:

string

php

utf-8

I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.

This function seems to work in all cases I tested.

Here is the question: Get file encoding

Here is the function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]              # Non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]         # Excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # Straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]         # Excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}      # Planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}          # Planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}      # Plane 16
    )+%xs', $string);
}

Is this a reliable way of detecting UTF-8 strings? What exactly is it doing? Can it be made more robust?

906

asked Mar 14 '12 23:03

Gary Willoughby

1 Answers

If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.

In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.

answered Oct 12 '22 11:10

deceze

Related questions
                            
                                how to send email from Apache server by PHP script
                            
                                How to track user usage on site?
                            
                                Reading barcode from webcam in PHP web application
                            
                                try catch block for the unpack function
                            
                                what is the difference between array_udiff_assoc and array_diff_uassoc
                            
                                Does Passing by Reference Avoid Creating New Variable?
                            
                                Designing classes & functions for object interaction [closed]
                            
                                Enable Shared Pager Cache in sqlite using PHP PDO
                            
                                Amazon S3 SDK: Change filename on download?
                            
                                Yii multiple relations
                            
                                Update/set an element of a cached array in Memcache
                            
                                Uploading files using jquery without page refresh
                            
                                Is it possible to get the current test in setUp fixture?
                            
                                Speeding up jQuery AutoComplete (Unavoidably long lists)
                            
                                How to add PHP code to .tpl file [duplicate]
                            
                                Nginx - Pass all 404 errors back to PHP-FPM for custom error page processing
                            
                                target numeric keys only in array
                            
                                How to build simple reviewing and 5-star rating system? [closed]
                            
                                Reading a Git commit message from PHP
                            
                                Merging two json in PHP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With