Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String corrupted or preg_match bug?

Tags:

php

utf-8

The NO-BREAK SPACE and many other UTF-8 symbols need 2 bytes to its representation; so, in a supposed context of UTF8 strings, an isolated (not preceded by xC2) byte of non-ASCII (>127) is a non-recognized character... Ok, it is only a layout problem (!), but it corrupts the whole string?

How to avoid this "non-expected behaviour"? (it occurs in some functions and not in others).

Example (generating an non-expected behaviour with preg_match only):

  header("Content-Type: text/plain; charset=utf-8"); // same if text/html
  //PHP Version 5.5.4-1+debphp.org~precise+1
  //using a .php file enconded as UTF8.

  $s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
  preg_match_all('/[-\'\p{L}]+/u',$s,$m);
  var_dump($m);            // empty! (corrupted)
  $m=str_word_count($s,1);
  var_dump($m);            // ok

  $s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE";  // utf8-encoded nbsp
  preg_match_all('/[-\'\p{L}]+/u',$s,$m);
  var_dump($m);            // ok!
  $m=str_word_count($s,1);
  var_dump($m);            // ok
like image 719
Peter Krauss Avatar asked Oct 11 '13 10:10

Peter Krauss


2 Answers

This is not a complete answer because I not say why some PHP functions "fail entirely on invalidly encoded strings" and others not: see @deceze at question's comments and @hakre answer. If you are looking for an PCRE-replacement for str_word_count(), see my preg_word_count() below.

PS: about "PHP5's build-in-library behaviour uniformity" discussion, my conclusion is that PHP5 is not so bad, but we have create a lot of user-defined wrap (façade) functions (see diversity of PHP-framworks!)... Or wait for PHP6 :-)


Thanks @pebbl! If I understand your link, there are a lack of error messagens on PHP. So a possible workaround of my illustred problem is to add an error condition... I find the condition here (it ensures valid utf8!)... And thanks @deceze for remember that exists a build-in function for check this condition (I edited the code after).

Putting the issues together, a solution translated to a function (EDITED, thanks to @hakre comments!),

 function my_word_count($s,$triggError=true) {
   if ( preg_match_all('/[-\'\p{L}]+/u',$s,$m) !== false )
      return count($m[0]);
   else {
      if ($triggError) trigger_error(
         // not need mb_check_encoding($s,'UTF-8'), see hakre's answer, 
         // so, I wrong, there are no 'misteious error' with preg functions
         (preg_last_error()==PREG_BAD_UTF8_ERROR)? 
              'non-UTF8 input!': 'other error',
         E_USER_NOTICE
         );
      return NULL;
   }
 }

Now (edited after thinking around @hakre answer), about uniform behaviour: we can develop a reasonable function with PCRE library that mimic the str_word_count behaviour, accepting bad UTF8. For this task I used the @bobince iconv tip:

 /**
  * Like str_word_count() but showing how preg can do the same.
  * This function is most flexible but not faster than str_word_count.
  * @param $wRgx the "word regular expression" as defined by user.
  * @param $triggError changes behaviour causing error event.
  * @param $OnBadUtfTryAgain mimic the str_word_count behaviour.
  * @return 0 or positive integer as word-count, negative as PCRE error.
  */
 function preg_word_count($s,$wRgx='/[-\'\p{L}]+/u', $triggError=true,
                          $OnBadUtfTryAgain=true) {
   if ( preg_match_all($wRgx,$s,$m) !== false )
      return count($m[0]);
   else {
      $lastError = preg_last_error();
      $chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
      if ($OnBadUtfTryAgain && $chkUtf8) 
         return preg_word_count(
            iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
         );
      elseif ($triggError) trigger_error(
         $chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
         E_USER_NOTICE
         );
      return -$lastError;
   }
 }

Demonstrating (try other inputs!):

 $s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
 print "\n-- str_word_count=".str_word_count($s,0);
 print "\n-- preg_word_count=".preg_word_count($s);

 $s = "THE UTF-8 NO-BREAK\xC2\xA0SPACE";  // utf8-encoded nbsp
 print "\n-- str_word_count=".str_word_count($s,0);
 print "\n-- preg_word_count=".preg_word_count($s);
like image 84
Peter Krauss Avatar answered Oct 27 '22 12:10

Peter Krauss


Okay, I can somewhat feel your disappointment that things didn't worked easily out switching from str_word_count to preg_match_all. However the way you ask the question is a bit imprecise, I try to answer it anyway. Imprecise, because you have a high amount of wrong assumptions that you obviously take for granted (it happens to the best of us). I hope I can correct this a little:

$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m);            // empty! (corrupted)

This code is wrong. You blame PHP here for not giving a warning or something, but I must admit, the only one to blame here is "you". PHP does allow you to check for the error. Before you judge so early that a warning has to be given in error handling, I have to remind you that there are different ways how to deal with errors. Some dealing is with giving messages, another type of dealing with errors is by telling about them with return values. And if we visit the manual page of preg_match_all and look for the documentation of the return value, we can find this:

Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.

The part at the end:

FALSE if an error occurred [Highlight by me]

is some common way in error handling to signal the calling code that some error occured. Let's review your code of which you think it does not work:

$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
preg_match_all('/[-\'\p{L}]+/u',$s,$m);
var_dump($m);            // empty! (corrupted)

The only thing this code shows is that the person who typed it (I guess it was you), clearly decided to not do any error handling. That's fine unless that person as well protests that the code won't work.

The sad thing about this is, that this is a common user-error, if you write fragile code (e.g. without error handling), don't expect it to work in a solid manner. That will never happen.

So what does this require when you program? First of all you should know about the functions you use. That normally requires knowledge about the input parameters and the return values. You find that information normally documented. Use the manual. Second you actually need to care about return values and do the error handling your own. The function alone does not know what it means if an error occured. Is it an exception? Then you need to do the exception handling probably as in the demo example:

<?php
/**
 * @link http://stackoverflow.com/q/19316127/367456
 */

$s = "THE UTF-8 NO-BREAK\xA0SPACE"; // a non-ASCII byte
$result = preg_match_all('/[-\'\p{L}]+/u',$s,$m);

if ($result === FALSE) {
    switch (preg_last_error()) {
        case PREG_BAD_UTF8_ERROR:
            throw new InvalidArgumentException(
                'UTF-8 encoded binary string expected.'
            );
        default:
            throw new RuntimeException('preg error occured.');

    }
}

var_dump($m);            // nothing at all corrupted...

In any case it means you need to look what you do, learn about it and write more code. No magic. No bug. Just a bit of work.

The other part you've in front of you is perhaps to understand what characters in a software are, but that is more independent to concrete programming languages like PHP, for example you can take an introductory read here:

  • A tutorial on character code issues
  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The first is a must read or perhaps must-bookmark, because it is a lot to read but it explains it all very good.

like image 30
hakre Avatar answered Oct 27 '22 12:10

hakre