Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 safe equivalent of ord or charCodeAt() in PHP

I need to be able to use ord() to get the same value as javascript's charCodeAt() function. The problem is that ord() doesn't support UTF8.

How can I get Ą to translate to 260 in PHP? I've tried some uniord functions out there, but they all report 256 instead of 260.

Thanks a lot for any help!

Regards

like image 355
Rila Avatar asked Apr 26 '12 12:04

Rila


People also ask

Does PHP support utf8?

PHP | utf8_encode() FunctionThe utf8_encode() function is an inbuilt function in PHP which is used to encode an ISO-8859-1 string to UTF-8. Unicode has been developed to describe all possible characters of all languages and includes a lot of symbols with one unique number for each symbol/character.

What is charCodeAt Javascript?

The charCodeAt() method returns the Unicode of the character at a specified index (position) in a string. The index of the first character is 0, the second is 1, .... The index of the last character is string length - 1 (See Examples below). See also the charAt() method.


2 Answers

mbstring version:

function utf8_char_code_at($str, $index)
{
    $char = mb_substr($str, $index, 1, 'UTF-8');

    if (mb_check_encoding($char, 'UTF-8')) {
        $ret = mb_convert_encoding($char, 'UTF-32BE', 'UTF-8');
        return hexdec(bin2hex($ret));
    } else {
        return null;
    }
}

using htmlspecialchars and htmlspecialchars_decode for getting one character:

function utf8_char_code_at($str, $index)
{
    $char = '';
    $str_index = 0;

    $str = utf8_scrub($str);
    $len = strlen($str);

    for ($i = 0; $i < $len; $i += 1) {

        $char .= $str[$i];

        if (utf8_check_encoding($char)) {

            if ($str_index === $index) {
                return utf8_ord($char);
            }

            $char = '';
            $str_index += 1;
        }
    }

    return null;
}

function utf8_scrub($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

function utf8_check_encoding($str)
{
    return $str === utf8_scrub($str);
}

function utf8_ord($char)
{
    $lead = ord($char[0]);

    if ($lead < 0x80) {
        return $lead;
    } else if ($lead < 0xE0) {
        return (($lead & 0x1F) << 6) 
      | (ord($char[1]) & 0x3F);
    } else if ($lead < 0xF0) {
        return (($lead &  0xF) << 12)
     | ((ord($char[1]) & 0x3F) <<  6)
     |  (ord($char[2]) & 0x3F);
    } else {
        return (($lead &  0x7) << 18)
     | ((ord($char[1]) & 0x3F) << 12)
     | ((ord($char[2]) & 0x3F) <<  6)
     |  (ord($char[3]) & 0x3F);
    }
}

PHP extension version:

#include "ext/standard/html.h"
#include "ext/standard/php_smart_str.h"

const zend_function_entry utf8_string_functions[] = {
    PHP_FE(utf8_char_code_at, NULL)
    PHP_FE_END
};

PHP_FUNCTION(utf8_char_code_at)
{
    char *str;
    int len;
    long index;

    unsigned int code_point;
    long i;
    int status;
    size_t pos = 0, old_pos = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "sl", &str, &len, &index) == FAILURE) {
        return;
    }

    for (i = 0; pos < len; ++i) {
        old_pos = pos;
        code_point = php_next_utf8_char((const unsigned char *) str, (size_t) len, &pos, &status);

        if (i == index) {
            if (status == SUCCESS) {
                RETURN_LONG(code_point);
            } else {
                RETURN_NULL();
            }

        }

    }

    RETURN_NULL();
}
like image 132
masakielastic Avatar answered Oct 15 '22 20:10

masakielastic


ord() works byte per byte (as most of PHPs standard string functions - if not all). You would need to convert it your own, for example with the help of the multibyte string extension:

$utf8Character = 'Ą';
list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8'));
echo $ord; # 260
like image 37
hakre Avatar answered Oct 15 '22 20:10

hakre