Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode (UTF8) string word count in PHP

I need to have the word count of the following unicode string. Using str_word_count:

$input = 'Hello, chào buổi sáng'; 
$count = str_word_count($input);
echo $count;

the result is

7

which is aparentley wrong.

How to get the desired result (4)?

like image 317
Hai Truong IT Avatar asked Apr 23 '26 17:04

Hai Truong IT


2 Answers

$tags = 'Hello, chào buổi sáng'; 
$word = explode(' ', $tags);
echo count($word);

Here's a demo: http://codepad.org/667Cr1pQ

like image 81
Joseph Silber Avatar answered Apr 26 '26 05:04

Joseph Silber


Here is a quick and dirty regex-based (using Unicode) word counting function:

function mb_count_words($string) {
    preg_match_all('/[\pL\pN\pPd]+/u', $string, $matches);
    return count($matches[0]);
}

A "word" is anything that contains one or more of:

  • Any alphabetic letter
  • Any digit
  • Any hyphen/dash

This would mean that the following contains 5 "words" (4 normal, 1 hyphenated):

 echo mb_count_words('Hello, chào buổi sáng, chào-sáng');

Now, this function is not well suited for very large texts; though it should be able to handle most of what counts as a block of text on the internet. This is because preg_match_all needs to build and populate a big array only to throw it away once counted (it is very inefficient). A more efficient way of counting would be to go through the text character by character, identifying unicode whitespace sequences, and incrementing an auxiliary variable. It would not be that difficult, but it is tedious and takes time.

like image 44
Sverri M. Olsen Avatar answered Apr 26 '26 07:04

Sverri M. Olsen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!