Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is PHP str_word_count() multibyte safe?

Tags:

php

utf-8

utf

I want to use str_word_count() on a UTF-8 string.

Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).

But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.

So I guess I want to know...

  1. Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?

  2. Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#

This is where the problem might lie I guess.

like image 965
carpii Avatar asked Nov 28 '11 01:11

carpii


2 Answers

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

  • Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

And perhaps as well:

  • Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
  • Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
  • Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
 * is PHP str_word_count() multibyte safe?
 * @link https://stackoverflow.com/q/8290537/367456
 */

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";

$test   = "aword\xA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

Output:

New Locale: en_US.utf8

array(3) {
  [0]=>
  string(5) "aword"
  [6]=>
  string(5) "bword"
  [12]=>
  string(5) "aword"
}

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

Instead for UTF-8 you should take a look into the PCRE extension:

  • Matching Unicode letter characters in PCRE/PHP

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

like image 97
hakre Avatar answered Sep 21 '22 16:09

hakre


About the "template answer" - I don't get the demand "working faster". We're not talking about long times or lot of counts here, so who cares if it takes some milliseconds longer or not?

However, a str_word_count working with soft hyphen:

function my_word_count($str) {
  return str_word_count(str_replace("\xC2\xAD",'', $str));
}

a function that complies with the asserts (but is probably not faster than str_word_count):

function my_word_count($str) {
  $mystr = str_replace("\xC2\xAD",'', $str);        // soft hyphen encoded in UTF-8
  return preg_match_all('~[\p{L}\'\-]+~u', $mystr); // regex expecting UTF-8
}

The preg function is essentially the same what's already proposed, except that a) it already returns a count so no need to supply matches, which should make it faster and b) there really should not be iconv fallback, IMO.


About a comment:

I can see that your PCRE functions are wrost (performance) than my preg_word_count() because need a str_replace that you not need: '~[^\p{L}\'-\xC2\xAD]+~u' works fine (!).

I considered that a different thing, string replace will only remove the multibyte character, but regex of yours will deal with \\xC2 and \\xAD in any order they might appear, which is wrong. Consider a registered sign, which is \xC2\xAE.

However, now that I think about it due to the way valid UTF-8 works, it wouldn't really matter, so that should be usable equally well. So we can just have the function

function my_word_count($str) {
  return preg_match_all('~[\p{L}\'\-\xC2\xAD]+~u', $str); // regex expecting UTF-8
}

without any need for matches or other replacements.

About str_word_count(str_replace("\xC2\xAD",'', $str));, if is stable with UTF8, is good, but seems is not.

If you read this thread, you'll know str_replace is safe if you stick to valid UTF-8 strings. I didn't see any evidence in your link of the contrary.

like image 23
eis Avatar answered Sep 18 '22 16:09

eis