Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JS charCodeAt equivalent in PHP (with full unicode and emoji compatibility)

I have a simple code in JS that I can't replicate in PHP if it comes to special characters.

This is the JS code (see JSFiddle for output):

var str = "tπŸ™πŸΏπŸ˜˜πŸŽšβ†™οΈπŸ•—πŸ‡¨πŸ‡¬π―¦”"; //char "t" and special characters, emojis, etc..
document.write("Length is: "+str.length); // Length is: 19
for(var i=0; i<str.length; i++) {
  document.write("<br> charCodeAt(" + i + "): " + str.charCodeAt(i));
}

The first problem is that PHP strlen() and mb_strlen() already gives different results from JS (strlen: 39, mb_strlen: 11), however I managed to get the same with a custom JS_StringLength function (thanks to this SO answer).

Here is what I have in PHP so far (see phpFiddle for output):

<?php

function JS_StringLength($string) {
    return strlen(iconv('UTF-8', 'UTF-16LE', $string)) / 2;
}

function JS_charCodeAt($str, $index){
    //not working!

    $char = mb_substr($str, $index, 1, 'UTF-8');
    if (mb_check_encoding($char, 'UTF-8'))
    {
        $ret = mb_convert_encoding($char, 'UTF-32BE', 'UTF-8');
        return hexdec(bin2hex($ret));
    } else {
        return null;
    }
}

$str = "tπŸ™πŸΏπŸ˜˜πŸŽšβ†™οΈπŸ•—πŸ‡¨πŸ‡¬π―¦”";

echo $str."\n";
//echo "Length is: ".strlen($str)."\n"; //wrong
echo "Length is: ".JS_StringLength($str)."\n"; //OK
for($i=0; $i<JS_StringLength($str); $i++) {
    echo "charCodeAt(".$i."): ".JS_charCodeAt($str, $i)."\n";
}

After a full day of Googling, and trying out everything I found, nothing gave the same results as JS. What should JS_charCodeAt be to get the same output as JS with similar performance?

Experimenting #1:
Enter my string into https://r12a.github.io/app-conversion/ (awesome stuff). Looks like JS works with UTF-16 code units (19) and PHP strlen counts UTF-8 code units (39).

Experimenting #2:
When using json_encode() on my string - of course - the result will almost be something like that, what JavaScript may uses. I even examined the original PHP source code of json_encode and how json_encode escapes strings, but.. well..


Before flagging as a duplicate, please make sure you test a solution with the string in the above examples (or random emojis) as ALL the charCodeAt implementations found here on stackoverflow are working with most of the special characters, but NOT with emojis.

like image 906
frzsombor Avatar asked Nov 28 '16 09:11

frzsombor


2 Answers

The way that JS handles UTF-16 is not ideal; charCodeAt is picking out code units for you, including surrogates in the emoji cases. If you want the real codepoint for each character, String.codePointAt() would be a better choice. That said, since your usecase wasn't explained, this achieves what you were originally asking for without the need for json related functions:

<?php

$original = 'tπŸ™πŸΏπŸ˜˜πŸŽšβ†™οΈπŸ•—πŸ‡¨πŸ‡¬π―¦”';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < iconv_strlen($converted, 'UTF-16LE'); $i++) {
    $character = iconv_substr($converted, $i, 1, 'UTF-16LE');
    $codeUnits = unpack('v*', $character);

    foreach ($codeUnits as $codeUnit) {
        echo $codeUnit . PHP_EOL;
    }
}

This converts the (assumed) UTF-8 string into UTF-16, then loops over each character. In UTF-16, each character is 2 or 4 bytes in size. Unpack with the v repeating formatter will return one short in the former case, or 2 in the latter (v is the unsigned short formatter).

It could also be implemented by looping over the UTF-8 and converting each character one-by-one; it doesn't make a great deal of difference though. Also the same could be achieved with the mb_* functions.


Edit

Since you've inquired about a quicker way of doing this, combining the above with the solution offered by nwellnhof gives better performance:

<?php

$original = 'tπŸ™πŸΏπŸ˜˜πŸŽšβ†™οΈπŸ•—πŸ‡¨πŸ‡¬π―¦”';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < strlen($converted); $i += 2) {
        $codeUnit = ord($converted[$i]) + (ord($converted[$i+1]) << 8);
        echo $codeUnit . PHP_EOL;
}

First off, this converts the UTF-8 string into UTF-16LE. We're interested in writing out UTF-16 code units (as per the behaviour charCodeAt()), and these are represented by 16 bits. The loop is simply jumping 2 bytes at a time. For each iteration, it'll take the numeric value of the byte at that position, and add it to the next byte, left shifted by 8. The left shifting is because we're dealing with little endian formatted UTF-16.

By way of example, take consider the character BENGALI DIGIT ONE (১). This is represented by a single UTF-16 code unit, 2535. It is easier to first off describe how this is encoded as UTF-16BE. The single code unit for this character would consume 16 bits:

0000100111100111 (2535)

In PHP, strings are effectively byte arrays. So, PHP sees this as:

$converted[0] = 00001001 (9)
$converted[1] = 11100111 (231)

Given the 2 above bytes, how do we obtain the code unit? What we really want to do is something like:

   0000100100000000 (2304)
+          11100111 (231)
=  0000100111100111 (2535)

But we can't do that, since we only have single bytes to play with. One way is to deal with this is to use integers instead, giving us a full 64 bits (8 bytes).. and we want to represent the code unit in integer form anyway, so that seems like a reasonable route. We can obtain the numeric value of each byte via ord():

ord($converted[0]) == 0000000000000000000000000000000000000000000000000000000000001001 == 9
ord($converted[1]) == 0000000000000000000000000000000000000000000000000000000011100111 = 231

And left shift the first value by 8:

   0000000000000000000000000000000000000000000000000000000000001001 (9) 
<< 0000000000000000000000000000000000000000000000000000000000001000 (8)
=  0000000000000000000000000000000000000000000000000000100100000000 (2304)

And then sum together, as before:

   0000000000000000000000000000000000000000000000000000100100000000 (2304)
+  0000000000000000000000000000000000000000000000000000000011100111 (231)
=  0000000000000000000000000000000000000000000000000000100111100111 (2535)

So we now have the correct code unit value of 2535. The only difference with UTF-16LE is the order of the bytes is reversed. So instead of left shifting the first byte by 8, we need to left shift the second byte.

P.S: An equivalent way of performing this step would be to do

for ($i = 0; $i < strlen($converted); $i += 2) {
        $codeUnit = unpack('v', $converted[$i] . $converted[$i+1]);
        echo $codeUnit . PHP_EOL;
}

The unpack function will do exactly as just described which the v formatter is supplied, which tells it to expect 16 bits arranged in little endian. It may be worth benchmarking the 2 if you're interested in optimising for speed.

like image 105
nj_ Avatar answered Sep 22 '22 15:09

nj_


[UPDATE: See a better solution in the accepted answer]

Ok, so after almost two days, I think I've found an answer myself. The basic idea is that json_encode() escapes multibyte Unicode characters, in a form, that JS uses them (like 😘 = "\ud83d\ude18") for character counting, for the charCodeAt function, etc. So if we JSON encode the string, we can extract an array of simple characters, and escaped multibyte chars. This way, we can easily count the characters of the original string as UTF-16 code units (just like JS does). And of course, we can return the "charCodeAt" values (ord() on simple characters, and converting \uXXXX hex to dec on multibyte characters).

Problem: If I want to get the "JS charCodeAt" value for every character in a for loop (so basically convert a string to charcode list), this code will be slow on long texts, because preg_match_all in getUTF16CodeUnits will run once for every single character.
Workaround: Instead of calling getUTF16CodeUnits every time, store the matches array in a variable, and work with that. More details: FASTER VERSION (backup)

Code and demo:

<?php

function getUTF16CodeUnits($string) {
    $string = substr(json_encode($string), 1, -1);
    preg_match_all("/\\\\u[0-9a-fA-F]{4}|./mi", $string, $matches);
    return $matches[0];
}

function JS_StringLength($string) {
    return count(getUTF16CodeUnits($string));
}

function JS_charCodeAt($string, $index) {
    $utf16CodeUnits = getUTF16CodeUnits($string);
    $unit = $utf16CodeUnits[$index];
    
    if(strlen($unit) > 1) {
        $hex = substr($unit, 2);
        return hexdec($hex);
    }
    else {
        return ord($unit);
    }
}

$str = "tπŸ™πŸΏπŸ˜˜πŸŽšβ†™οΈπŸ•—πŸ‡¨πŸ‡¬π―¦”";

echo "Length is: ".JS_StringLength($str)."\n";
for($i=0; $i<JS_StringLength($str); $i++) {
    echo "charCodeAt(".$i."): ".JS_charCodeAt($str, $i)."\n";
}

Improvements, fixes, comments are highly appreciated!

like image 27
frzsombor Avatar answered Sep 24 '22 15:09

frzsombor