Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A PHP Library / Class to Count Words in Various Languages?

Some time in the near future I will need to implement a cross-language word count, or if that is not possible, a cross-language character count.

By word count I mean an accurate count of the words contained within the given text, taking the language of the text. The language of the text is set by a user, and will be assumed to be correct.

By character count I mean a count of the "possibly in a word" characters contained within the given text, with the same language information described above.

I would much prefer the former count, but I am aware of the difficulties involved. I am also aware that the latter count is much easier, but very much prefer the former, if at all possible.

I'd love it if I just had to look at English, but I need to consider every language here, Chinese, Korean, English, Arabic, Hindi, and so on.

I would like to know if Stack Overflow has any leads on where to start looking for an existing product / method to do this in PHP, as I am a good lazy programmer*

A simple test showing how str_word_count with set_locale doesn't work, and a function from php.net's str_word_count page.

*http://blogoscoped.com/archive/2005-08-24-n14.html

like image 366
Michael Robinson Avatar asked May 29 '10 15:05

Michael Robinson


1 Answers

Counting chars is easy:

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.

The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));
like image 160
Alix Axel Avatar answered Oct 17 '22 06:10

Alix Axel