Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

str_word_count() function doesn't display Arabic language properly

Tags:

function

php

I've made the next function to return a specific number of words from a text:

function brief_text($text, $num_words = 50) {
    $words = str_word_count($text, 1);
    $required_words = array_slice($words, 0, $num_words);
    return implode(" ", $required_words);
}

and it works pretty well with English language but when I try to use it with Arabic language it fails and doesn't return words as expected. For example:

$text_en = "Cairo is the capital of Egypt and Paris is the capital of France";
echo brief_text($text_en, 10);

will output Cairo is the capital of Egypt and Paris is the while

$text_ar = "القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا";
echo brief_text($text_ar, 10); 

will output � � � � � � � � � �.

I know that the problem is with the str_word_count function but I don't know how to fix it.

UPDATE

I have already written another function that works pretty good with both English and Arabic languages, but I was looking for a solution for the problem caused by str_word_count() function when using with Arabic. Anyway here is my other function:

    function brief_text($string, $number_of_required_words = 50) {
        $string = trim(preg_replace('/\s+/', ' ', $string));
        $words = explode(" ", $string);
        $required_words = array_slice($words, 0, $number_of_required_words); // get sepecific number of elements from the array
        return implode(" ", $required_words);
    }
like image 355
Amr Avatar asked Dec 14 '12 18:12

Amr


3 Answers

Try with this function for word count:

// You can call the function as you like
if (!function_exists('mb_str_word_count'))
{
    function mb_str_word_count($string, $format = 0, $charlist = '[]') {
        mb_internal_encoding( 'UTF-8');
        mb_regex_encoding( 'UTF-8');

        $words = mb_split('[^\x{0600}-\x{06FF}]', $string);
        switch ($format) {
            case 0:
                return count($words);
                break;
            case 1:
            case 2:
                return $words;
                break;
            default:
                return $words;
                break;
        }
    };
}



echo mb_str_word_count("القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا") . PHP_EOL;

Resources

  • Unicode list for arabic
  • A Rule-Based Arabic Stemming Algorithm
  • A Rule and Template Based Stemming Algorithm for Arabic Language (seems more complete)

Recommentations

  • Use the tag <meta charset="UTF-8"/> in HTML files
  • Always add Content-type: text/html; charset=utf-8 headers when serving pages
like image 130
rkmax Avatar answered Nov 20 '22 02:11

rkmax


For accepting ASCII characters too:

if (!function_exists('mb_str_word_count'))
{
    function mb_str_word_count($string, $format = 0, $charlist = '[]') {
        $string=trim($string);
        if(empty($string))
            $words = array();
        else
            $words = preg_split('~[^\p{L}\p{N}\']+~u',$string);
        switch ($format) {
            case 0:
                return count($words);
                break;
            case 1:
            case 2:
                return $words;
                break;
            default:
                return $words;
                break;
        }
    }
}
like image 44
ahoo Avatar answered Nov 20 '22 04:11

ahoo


hi friend if you want to get count of word in Farsi language or Arabic you can use below code

public function customWordCount($content_text)
{
    $resultArray = explode(' ',trim($content_text));
    foreach ($resultArray as $key => $item)
    {
        if (in_array($item,["|",";",".","-","=",":","{","}","[","]","(",")"]))
        {
            $resultArray[$key] = '';
        }
    }

    $resultArray = array_filter($resultArray);
    return count($resultArray);
}
like image 1
Alireza Salehi Avatar answered Nov 20 '22 03:11

Alireza Salehi