Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 validation in PHP without using preg_match()

I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code:

preg_match('/\A(
     [\x09\x0A\x0D\x20-\x7E]
   | [\xC2-\xDF][\x80-\xBF]
   |  \xE0[\xA0-\xBF][\x80-\xBF]
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
   |  \xED[\x80-\x9F][\x80-\xBF]
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}
   | [\xF1-\xF3][\x80-\xBF]{3}
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}
  )*\z/x', $string);

It's a regular expression taken from http://www.w3.org/International/questions/qa-forms-utf-8 . Everything was ok until I discovered a bug in PHP that seems to have been around at least since 2006. Preg_match() causes a seg fault if the $string is too long. There doesn't seem to be any workaround. You can view the bug submission here: http://bugs.php.net/bug.php?id=36463

Now, to avoid using preg_match I've created a function that does the exact same thing as the regular expression above. I don't know if this question is appropriate here at Stack Overflow, but I would like to know if the function I've made is correct. Here it is:

EDIT [13.01.2010]: If anyone is interested, there were several bugs in the previous version I've posted. Below is the final version of my function.

function check_UTF8_string(&$string) {
    $len = mb_strlen($string, "ISO-8859-1");
    $ok = 1;

    for ($i = 0; $i < $len; $i++) {
        $o = ord(mb_substr($string, $i, 1, "ISO-8859-1"));

        if ($o == 9 || $o == 10 || $o == 13 || ($o >= 32 && $o <= 126)) {

        }
        elseif ($o >= 194 && $o <= 223) {
            $i++;
            $o2 = ord(mb_substr($string, $i, 1, "ISO-8859-1"));
            if (!($o2 >= 128 && $o2 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 224) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 160 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif (($o >= 225 && $o <= 236) || $o == 238 || $o == 239) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 128 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 237) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 128 && $o2 <= 159) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 240) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 3;
            if (!($o2 >= 144 && $o2 <= 191) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o >= 241 && $o <= 243) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 3;
            if (!($o2 >= 128 && $o2 <= 191) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 244) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 5;
            if (!($o2 >= 128 && $o2 <= 143) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        else {
            $ok = 0;
            break;
        }
    }

    return $ok;
}

Yes, it's very long. I hope I've understood correctly how that regular expression works. Also hope it will be of help to others.

Thanks in advance!

like image 345
liviucmg Avatar asked Aug 15 '09 22:08

liviucmg


People also ask

How to preg_ match in PHP?

PHP preg_match() Function$str = "Visit W3Schools"; $pattern = "/w3schools/i"; echo preg_match($pattern, $str);

What does preg_ match return?

preg_match() returns 1 if the pattern matches given subject , 0 if it does not, or false on failure. Warning. This function may return Boolean false , but may also return a non-Boolean value which evaluates to false .


3 Answers

You can always using the Multibyte String Functions:

If you want to use it a lot and possibly change it at sometime:

1) First set the encoding you want to use in your config file

/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");

2) Check the String

if(mb_check_encoding($string))
{
    // do something
}

Or, if you don't plan on changing it, you can always just put the encoding straight into the function:

if(mb_check_encoding($string, 'UTF-8'))
{
    // do something
}
like image 92
Tyler Carter Avatar answered Oct 14 '22 04:10

Tyler Carter


Given that there is still no explicit isUtf8() function in PHP, here's how UTF-8 can be accurately validated in PHP depending on your PHP version.

Easiest and most backwards compatible way to properly validate UTF-8 is still via regular expression using function such as:

function isValid($string)
{
    return preg_match(
        '/\A(?>
            [\x00-\x7F]+                       # ASCII
          | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
          |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
          | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
          |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
          |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
          | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
          |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )*\z/x',
        $string
    ) === 1;
}

Note the two key differences to the regular expression offered by W3C. It uses once only subpattern and has a '+' quantifier after the first character class. The problem of PCRE crashing still persists, but most of it is caused by using repeating capturing subpattern. By turning the pattern to a once only pattern and capturing multiple single byte characters in single subpattern, it should prevent PCRE from quickly running out of stack (and causing a segfault). Unless you're validating strings with lots of multibyte characters (in the range of thousands), this regular expression should serve you well.

Another good alternative is using mb_check_encoding() if you have the mbstring extension available. Validating UTF-8 can be done as simply as:

function isValid($string)
{
    return mb_check_encoding($string, 'UTF-8') === true;
}

Note, however, that if you're using PHP version prior to 5.4.0, this function has some flaws in it's validation:

  • Prior to 5.4.0 the function accepts code point beyond allowed Unicode range. This means it also allows 5 and 6 byte UTF-8 characters.
  • Prior to 5.3.0 the function accepts surrogate code points as valid UTF-8 characters.
  • Prior to 5.2.5 the function is completely unusable due to not working as intended.

As the internet also lists numerous other ways to validate UTF-8, I will discuss some of them here. Note that the following should be avoided in most cases.

Use of mb_detect_encoding() is sometimes seen to validate UTF-8. If you have at least PHP version 5.4.0, it does actually work with the strict parameter via:

function isValid($string)
{
    return mb_detect_encoding($string, 'UTF-8', true) === 'UTF-8';
}

It is very important to understand that this does not work prior to 5.4.0. It's very flawed prior to that version, since it only checks for invalid sequences but allows overlong sequences and invalid code points. In addition, you should never use it for this purpose without the strict parameter set to true (it does not actually do validation without the strict parameter).

One nifty way to validate UTF-8 is via the use of 'u' flag in PCRE. Though poorly documented, it also validates the subject string. An example could be:

function isValid($string)
{
    return preg_match('//u', $string) === 1;
}

Every string should match an empty pattern, but usage of the 'u' flag will only match on valid UTF-8 strings. However, unless you're using at least 5.5.10. The validation is flawed as follows:

  • Prior to 5.5.10, it does not recognize 3 and 4 byte sequences as valid UTF-8. As it excludes most of unicode code points, this is pretty major flaw.
  • Prior to 5.2.5 it also allows surrogates and code points beyond allowed unicode space (e.g. 5 and 6 byte characters)

Using the 'u' flag behavior does have one advantage though: It's the fastest of the discussed methods. If you need speed and you're running the latest and greatest PHP version, this validation method might be for you.

One additional way to validate for UTF-8 is via json_encode(), which expects input strings to be in UTF-8. It does not work prior to 5.5.0, but after that, invalid sequences return false instead of a string. For example:

function isValid($string)
{
    return json_encode($string) !== false;
}

I would not recommend on relying on this behavior to last, however. Previous PHP versions simply produced an error on invalid sequences, so there is no guarantee that the current behavior is final.

like image 30
Riimu Avatar answered Oct 14 '22 04:10

Riimu


You should be able to use iconv to check for validity. Just try and convert it to UTF-16 and see if you get an error.

like image 36
derobert Avatar answered Oct 14 '22 05:10

derobert