Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP String Function with non-English languages

Tags:

php

utf-8

I was trying range(); function with non-English language. It is not working.

$i =0
foreach(range('क', 'म') as $ab) {

    ++$i;

    $alphabets[$ab] = $i;

}

Output: à =1

It was Hindi (India) alphabets. It is only iterating only once (Output shows).

For this, I am not getting what to do!

So, if possible, please tell me what to do for this and what should I do first before thinking of working with non-English text with any PHP functions.

like image 363
Satya Prakash Avatar asked Oct 22 '11 14:10

Satya Prakash


2 Answers

Short answer: it's not possible to use range like that.

Explanation

You are passing the string 'क' as the start of the range and 'म' as the end. You are getting only one character back, and that character is à.

You are getting back à because your source file is encoded (saved) in UTF-8. One can tell this by the fact that à is code point U+00E0, while 0xE0 is also the first byte of the UTF-8 encoded form of 'क' (which is 0xE0 0xA4 0x95). Sadly, PHP has no notion of encodings so it just takes the first byte it sees in the string and uses that as the "start" character.

You are getting back only à because the UTF-8 encoded form of 'म' also starts with 0xE0 (so PHP also thinks that the "end character" is 0xE0 or à).

Solution

You can write range as a for loop yourself, as long as there is some function that returns the Unicode code point of an UTF-8 character (and one that does the reverse). So I googled and found these here:

// Returns the UTF-8 character with code point $intval
function unichr($intval) {
    return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}

// Returns the code point for a UTF-8 character
function uniord($u) {
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}

With the above, you can now write:

for($char = uniord('क'); $char <= uniord('म'); ++$char) {
    $alphabet[] = unichr($char);
}

print_r($alphabet);

See it in action.

like image 150
Jon Avatar answered Nov 02 '22 13:11

Jon


The lazy solution would be to use html_entity_decode() and range() only for the numeric ranges it was originally intended (that it works with ASCII is a bit silly anyway):

foreach (range(0x0915, 0x092E) as $char) {

    $char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
    $alphabets[$char] = ++$i;
}
like image 35
mario Avatar answered Nov 02 '22 12:11

mario