Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

str_word_count() for non-latin words?

Tags:

php

count

im trying to count the number of words in variable written in non-latin language (Bulgarian). But it seems that str_word_count() is not counting non-latin words. The encoding of the php file is UTF-8

$str = "текст на кирилица";
echo 'Number of words: '.str_word_count($str);
//this returns 0
like image 753
mr.d Avatar asked Apr 11 '14 14:04

mr.d


People also ask

How does str_ word_ count work?

The str_word_count() function is a built-in function in PHP and is used to return information about words used in a string like total number word in the string, positions of the words in the string etc. Parameters Used: $string:This parameter specifies the string whose words the user intends to count.

How many words are in a string PHP?

Step 1: Remove the trailing and leading white spaces using the trim() method. Step 2: Convert the multiple white spaces into single space using the substr_count() and str_replace() method. Step 3: Now counts the number of word in a string using substr_count($str, ” “)+1 and return the result.

How do I count the number of repeated characters in a string in PHP?

php $string = "aabbbccddd"; $array=array($array); foreach (count_chars($string, 1) as $i => $val) { $count=chr($i); $array[]= $val. ",".


4 Answers

You may do it with regex:

$str = "текст на кирилица";
echo 'Number of words: '.count(preg_split('/\s+/', $str));

here I'm defining word delimiter as space characters. If there may be something else that will be treated as word delimiter, you'll need to add it into your regex.

Also, note, that since there's no utf characters in regex (not in string) - /u modifier isn't required. But if you'll want some utf characters to act as delimiter, you'll need to add this regex modifier.

Update:

If you want only cyrillic letters to be treated in words, you may use:

$str = "текст 
на 12453
кирилица";
echo 'Number of words: '.count(preg_split('/[^А-Яа-яЁё]+/u', $str));
like image 185
Alma Do Avatar answered Oct 12 '22 14:10

Alma Do


And here is the solution that come to my mind:

$var = "текст на кирилица с пет думи";
$array = explode(" ", $var);

$i = 0;
foreach($array as $item) 
    {
    if(strlen($item) > 2) $i++ ;
    }

echo $i; // will return 5
like image 20
mr.d Avatar answered Oct 12 '22 14:10

mr.d


As it stated in str_word_count description

'word' is defined as a locale dependent string

Specify Bulgarian locale before calling str_word_count

setlocale(LC_ALL, 'bg_BG');
echo str_word_count($content);

Read more about setlocale here.

like image 21
debugger Avatar answered Oct 12 '22 12:10

debugger


The best solution I found is to provide a list of characters for word count function:

$text = 'текст на кирилице and on english too';
$count = str_word_count($text, 0, 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя');
echo $count; // => 7
like image 1
Даниил Пронин Avatar answered Oct 12 '22 12:10

Даниил Пронин