I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code: <pre class="prettyprint"><code>preg_match('/\A( [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*\z/x', $string); </code></pre> It's a regular expression taken from http://www.w3.org/International/questions/qa-forms-utf-8 . Everything was ok until I discovered a bug in PHP that seems to have been around at least since 2006. Preg_match() causes a seg fault if the $string is too long. There doesn't seem to be any workaround. You can view the bug submission here: http://bugs.php.net/bug.php?id=36463 Now, to avoid using preg_match I've created a function that does the exact same thing as the regular expression above. I don't know if this question is appropriate here at Stack Overflow, but I would like to know if the function I've made is correct. Here it is: EDIT [13.01.2010]: If anyone is interested, there were several bugs in the previous version I've posted. Below is the final version of my function. <pre class="prettyprint"><code>function check_UTF8_string(&$string) { $len = mb_strlen($string, "ISO-8859-1"); $ok = 1; for ($i = 0; $i < $len; $i++) { $o = ord(mb_substr($string, $i, 1, "ISO-8859-1")); if ($o == 9 || $o == 10 || $o == 13 || ($o >= 32 && $o <= 126)) { } elseif ($o >= 194 && $o <= 223) { $i++; $o2 = ord(mb_substr($string, $i, 1, "ISO-8859-1")); if (!($o2 >= 128 && $o2 <= 191)) { $ok = 0; break; } } elseif ($o == 224) { $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1")); $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1")); $i += 2; if (!($o2 >= 160 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) { $ok = 0; break; } } elseif (($o >= 225 && $o <= 236) || $o == 238 || $o == 239) { $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1")); $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1")); $i += 2; if (!($o2 >= 128 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) { $ok = 0; break; } } elseif ($o == 237) { $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1")); $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1")); $i += 2; if (!($o2 >= 128 && $o2 <= 159) || !($o3 >= 128 && $o3 <= 191)) { $ok = 0; break; } } elseif ($o == 240) { $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1")); $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1")); $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1")); $i += 3; if (!($o2 >= 144 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191) || !($o4 >= 128 && $o4 <= 191)) { $ok = 0; break; } } elseif ($o >= 241 && $o <= 243) { $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1")); $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1")); $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1")); $i += 3; if (!($o2 >= 128 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191) || !($o4 >= 128 && $o4 <= 191)) { $ok = 0; break; } } elseif ($o == 244) { $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1")); $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1")); $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1")); $i += 5; if (!($o2 >= 128 && $o2 <= 143) || !($o3 >= 128 && $o3 <= 191) || !($o4 >= 128 && $o4 <= 191)) { $ok = 0; break; } } else { $ok = 0; break; } } return $ok; } </code></pre> Yes, it's very long. I hope I've understood correctly how that regular expression works. Also hope it will be of help to others. Thanks in advance!

You can always using the Multibyte String Functions: If you want to use it a lot and possibly change it at sometime: 1) First set the encoding you want to use in your config file <pre class="prettyprint"><code>/* Set internal character encoding to UTF-8 */ mb_internal_encoding("UTF-8"); </code></pre> 2) Check the String <pre class="prettyprint"><code>if(mb_check_encoding($string)) { // do something } </code></pre> Or, if you don't plan on changing it, you can always just put the encoding straight into the function: <pre class="prettyprint"><code>if(mb_check_encoding($string, 'UTF-8')) { // do something } </code></pre>

Given that there is still no explicit isUtf8() function in PHP, here's how UTF-8 can be accurately validated in PHP depending on your PHP version. Easiest and most backwards compatible way to properly validate UTF-8 is still via regular expression using function such as: <pre class="prettyprint"><code>function isValid($string) { return preg_match( '/\A(?> [\x00-\x7F]+ # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*\z/x', $string ) === 1; } </code></pre> Note the two key differences to the regular expression offered by W3C. It uses once only subpattern and has a '+' quantifier after the first character class. The problem of PCRE crashing still persists, but most of it is caused by using repeating capturing subpattern. By turning the pattern to a once only pattern and capturing multiple single byte characters in single subpattern, it should prevent PCRE from quickly running out of stack (and causing a segfault). Unless you're validating strings with lots of multibyte characters (in the range of thousands), this regular expression should serve you well. Another good alternative is using <code>mb_check_encoding()</code> if you have the mbstring extension available. Validating UTF-8 can be done as simply as: <pre class="prettyprint"><code>function isValid($string) { return mb_check_encoding($string, 'UTF-8') === true; } </code></pre> Note, however, that if you're using PHP version prior to 5.4.0, this function has some flaws in it's validation: <ul> <li>Prior to 5.4.0 the function accepts code point beyond allowed Unicode range. This means it also allows 5 and 6 byte UTF-8 characters.</li> <li>Prior to 5.3.0 the function accepts surrogate code points as valid UTF-8 characters.</li> <li>Prior to 5.2.5 the function is completely unusable due to not working as intended.</li> </ul> <hr> As the internet also lists numerous other ways to validate UTF-8, I will discuss some of them here. Note that the following should be avoided in most cases. Use of <code>mb_detect_encoding()</code> is sometimes seen to validate UTF-8. If you have at least PHP version 5.4.0, it does actually work with the strict parameter via: <pre class="prettyprint"><code>function isValid($string) { return mb_detect_encoding($string, 'UTF-8', true) === 'UTF-8'; } </code></pre> It is very important to understand that this does not work prior to 5.4.0. It's very flawed prior to that version, since it only checks for invalid sequences but allows overlong sequences and invalid code points. In addition, you should never use it for this purpose without the strict parameter set to true (it does not actually do validation without the strict parameter). One nifty way to validate UTF-8 is via the use of 'u' flag in PCRE. Though poorly documented, it also validates the subject string. An example could be: <pre class="prettyprint"><code>function isValid($string) { return preg_match('//u', $string) === 1; } </code></pre> Every string should match an empty pattern, but usage of the 'u' flag will only match on valid UTF-8 strings. However, unless you're using at least 5.5.10. The validation is flawed as follows: <ul> <li>Prior to 5.5.10, it does not recognize 3 and 4 byte sequences as valid UTF-8. As it excludes most of unicode code points, this is pretty major flaw.</li> <li>Prior to 5.2.5 it also allows surrogates and code points beyond allowed unicode space (e.g. 5 and 6 byte characters)</li> </ul> Using the 'u' flag behavior does have one advantage though: It's the fastest of the discussed methods. If you need speed and you're running the latest and greatest PHP version, this validation method might be for you. One additional way to validate for UTF-8 is via <code>json_encode()</code>, which expects input strings to be in UTF-8. It does not work prior to 5.5.0, but after that, invalid sequences return false instead of a string. For example: <pre class="prettyprint"><code>function isValid($string) { return json_encode($string) !== false; } </code></pre> I would not recommend on relying on this behavior to last, however. Previous PHP versions simply produced an error on invalid sequences, so there is no guarantee that the current behavior is final.

UTF-8 validation in PHP without using preg_match()

Tags:

regex

validation

php

utf-8

I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code:

preg_match('/\A(
     [\x09\x0A\x0D\x20-\x7E]
   | [\xC2-\xDF][\x80-\xBF]
   |  \xE0[\xA0-\xBF][\x80-\xBF]
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
   |  \xED[\x80-\x9F][\x80-\xBF]
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}
   | [\xF1-\xF3][\x80-\xBF]{3}
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}
  )*\z/x', $string);

It's a regular expression taken from http://www.w3.org/International/questions/qa-forms-utf-8 . Everything was ok until I discovered a bug in PHP that seems to have been around at least since 2006. Preg_match() causes a seg fault if the $string is too long. There doesn't seem to be any workaround. You can view the bug submission here: http://bugs.php.net/bug.php?id=36463

Now, to avoid using preg_match I've created a function that does the exact same thing as the regular expression above. I don't know if this question is appropriate here at Stack Overflow, but I would like to know if the function I've made is correct. Here it is:

EDIT [13.01.2010]: If anyone is interested, there were several bugs in the previous version I've posted. Below is the final version of my function.

function check_UTF8_string(&$string) {
    $len = mb_strlen($string, "ISO-8859-1");
    $ok = 1;

    for ($i = 0; $i < $len; $i++) {
        $o = ord(mb_substr($string, $i, 1, "ISO-8859-1"));

        if ($o == 9 || $o == 10 || $o == 13 || ($o >= 32 && $o <= 126)) {

        }
        elseif ($o >= 194 && $o <= 223) {
            $i++;
            $o2 = ord(mb_substr($string, $i, 1, "ISO-8859-1"));
            if (!($o2 >= 128 && $o2 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 224) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 160 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif (($o >= 225 && $o <= 236) || $o == 238 || $o == 239) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 128 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 237) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 128 && $o2 <= 159) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 240) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 3;
            if (!($o2 >= 144 && $o2 <= 191) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o >= 241 && $o <= 243) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 3;
            if (!($o2 >= 128 && $o2 <= 191) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 244) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 5;
            if (!($o2 >= 128 && $o2 <= 143) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        else {
            $ok = 0;
            break;
        }
    }

    return $ok;
}

Yes, it's very long. I hope I've understood correctly how that regular expression works. Also hope it will be of help to others.

Thanks in advance!

345

asked Aug 15 '09 22:08

liviucmg

3 Answers

You can always using the Multibyte String Functions:

If you want to use it a lot and possibly change it at sometime:

1) First set the encoding you want to use in your config file

/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");

2) Check the String

if(mb_check_encoding($string))
{
    // do something
}

Or, if you don't plan on changing it, you can always just put the encoding straight into the function:

if(mb_check_encoding($string, 'UTF-8'))
{
    // do something
}

answered Oct 14 '22 04:10

Tyler Carter

Given that there is still no explicit isUtf8() function in PHP, here's how UTF-8 can be accurately validated in PHP depending on your PHP version.

Easiest and most backwards compatible way to properly validate UTF-8 is still via regular expression using function such as:

function isValid($string)
{
    return preg_match(
        '/\A(?>
            [\x00-\x7F]+                       # ASCII
          | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
          |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
          | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
          |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
          |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
          | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
          |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )*\z/x',
        $string
    ) === 1;
}

Note the two key differences to the regular expression offered by W3C. It uses once only subpattern and has a '+' quantifier after the first character class. The problem of PCRE crashing still persists, but most of it is caused by using repeating capturing subpattern. By turning the pattern to a once only pattern and capturing multiple single byte characters in single subpattern, it should prevent PCRE from quickly running out of stack (and causing a segfault). Unless you're validating strings with lots of multibyte characters (in the range of thousands), this regular expression should serve you well.

Another good alternative is using mb_check_encoding() if you have the mbstring extension available. Validating UTF-8 can be done as simply as:

function isValid($string)
{
    return mb_check_encoding($string, 'UTF-8') === true;
}

Note, however, that if you're using PHP version prior to 5.4.0, this function has some flaws in it's validation:

Prior to 5.4.0 the function accepts code point beyond allowed Unicode range. This means it also allows 5 and 6 byte UTF-8 characters.
Prior to 5.3.0 the function accepts surrogate code points as valid UTF-8 characters.
Prior to 5.2.5 the function is completely unusable due to not working as intended.

As the internet also lists numerous other ways to validate UTF-8, I will discuss some of them here. Note that the following should be avoided in most cases.

Use of mb_detect_encoding() is sometimes seen to validate UTF-8. If you have at least PHP version 5.4.0, it does actually work with the strict parameter via:

function isValid($string)
{
    return mb_detect_encoding($string, 'UTF-8', true) === 'UTF-8';
}

It is very important to understand that this does not work prior to 5.4.0. It's very flawed prior to that version, since it only checks for invalid sequences but allows overlong sequences and invalid code points. In addition, you should never use it for this purpose without the strict parameter set to true (it does not actually do validation without the strict parameter).

One nifty way to validate UTF-8 is via the use of 'u' flag in PCRE. Though poorly documented, it also validates the subject string. An example could be:

function isValid($string)
{
    return preg_match('//u', $string) === 1;
}

Every string should match an empty pattern, but usage of the 'u' flag will only match on valid UTF-8 strings. However, unless you're using at least 5.5.10. The validation is flawed as follows:

Prior to 5.5.10, it does not recognize 3 and 4 byte sequences as valid UTF-8. As it excludes most of unicode code points, this is pretty major flaw.
Prior to 5.2.5 it also allows surrogates and code points beyond allowed unicode space (e.g. 5 and 6 byte characters)

Using the 'u' flag behavior does have one advantage though: It's the fastest of the discussed methods. If you need speed and you're running the latest and greatest PHP version, this validation method might be for you.

One additional way to validate for UTF-8 is via json_encode(), which expects input strings to be in UTF-8. It does not work prior to 5.5.0, but after that, invalid sequences return false instead of a string. For example:

function isValid($string)
{
    return json_encode($string) !== false;
}

I would not recommend on relying on this behavior to last, however. Previous PHP versions simply produced an error on invalid sequences, so there is no guarantee that the current behavior is final.

answered Oct 14 '22 04:10

Riimu

You should be able to use iconv to check for validity. Just try and convert it to UTF-16 and see if you get an error.

answered Oct 14 '22 05:10

derobert

Related questions
                            
                                mod_rewrite, php and the .htaccess file
                            
                                Can I stop settings in vimrc from being overwritten by plugins?
                            
                                How to dynamically create an image with a specified number on it?
                            
                                What to choose to store just one integer? Sqlite? or Text file?
                            
                                Do you only run htmlspecialchars() on output or is there other functionality you also do?
                            
                                What is the scale of PHP's circular reference problem and should I worry about it?
                            
                                Passing variables and data through a regular web page link?
                            
                                Does this function exist in PHP?
                            
                                When and How to use Multiple MySQL Queries with PHP (PDO)
                            
                                php: remove excess <br> and tags from a string
                            
                                Cookies are Not Being Set Properly in PHP Script
                            
                                How to select value of adjacent hidden input with Jquery?
                            
                                In symfony, how to set the value of a form field?
                            
                                Is it dangerous thing to view access log without sanitizing via web browser?
                            
                                PHP Error handling
                            
                                Get HTML page <input> values and names using regex on PHP
                            
                                Get a URL from a String
                            
                                combinations: avoiding multiple nested foreach
                            
                                Adding support for i18n in PHP with gettext?
                            
                                How to debug a PHP application?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With