Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the character from unicode code point in PHP?

For example,

how to get the character corresponding to U+010F?

like image 286
omg Avatar asked Sep 02 '09 02:09

omg


People also ask

How to get Unicode of character in PHP?

In PHP, we can use the mb_ord() function to get the Unicode code point value of a given character. This function is supported in PHP 7 or higher versions. The mb_ord() function complements the mc_chr() function.

How to print Unicode characters in PHP?

The best way is to tell the browser that UTF-8 is being used by sending the corresponding HTTP header: header("content-type: text/html; charset=UTF-8"); Then, you can leave the rest of your code as-is and don't have to html-encode entities or create other mess.

How do I find Unicode value of a character?

We can determine the unicode category for a particular character by using the getType() method. It is a static method of Character class and it returns an integer value of char ch representing in unicode general category.

Does PHP use Unicode?

PHP does not offer native Unicode support. PHP only supports a 256-character set. However, PHP provides the UTF-8 functions utf8_encode() and utf8_decode() to provide some basic Unicode functionality. See the PHP manual for strings for more details about PHP and Unicode.


2 Answers

header('Content-Encoding: UTF-8');

function mb_html_entity_decode($string)
{
    if (extension_loaded('mbstring') === true)
    {
        mb_language('Neutral');
        mb_internal_encoding('UTF-8');
        mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));

        return mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
    }

    return html_entity_decode($string, ENT_COMPAT, 'UTF-8');
}

function mb_ord($string)
{
    if (extension_loaded('mbstring') === true)
    {
        mb_language('Neutral');
        mb_internal_encoding('UTF-8');
        mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));

        $result = unpack('N', mb_convert_encoding($string, 'UCS-4BE', 'UTF-8'));

        if (is_array($result) === true)
        {
            return $result[1];
        }
    }

    return ord($string);
}

function mb_chr($string)
{
    return mb_html_entity_decode('&#' . intval($string) . ';');
}

var_dump(hexdec('010F'));

var_dump(mb_ord('ó')); // 243
var_dump(mb_chr(243)); // ó
like image 102
Alix Axel Avatar answered Oct 26 '22 12:10

Alix Axel


I just wrote a polyfill for missing multibyte versions of ord and chr with the following in mind:

  • It defines functions mb_ord and mb_chr only if they don't already exist. If they do exist in your framework or some future version of PHP, the polyfill will be ignored.

  • It uses the widely used mbstring extension to do the conversion. If the mbstring extension is not loaded, it will use the iconv extension instead.


EDIT :

I added functions for HTMLentities encoding / decoding and encoding / decoding to JSON format as well as some demo code for how to use these functions


Code :

if (!function_exists('codepoint_encode')) {
    function codepoint_encode($str) {
        return substr(json_encode($str), 1, -1);
    }
}

if (!function_exists('codepoint_decode')) {
    function codepoint_decode($str) {
        return json_decode(sprintf('"%s"', $str));
    }
}

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}

if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}

if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_htmlentities')) {
    function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
        return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
            return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
        }, $string);
    }
}

if (!function_exists('mb_html_entity_decode')) {
    function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
        return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
    }
}

How to use :

echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

Output :

Get string from numeric DEC value
string(4) "ď"
string(2) "ď"

Get string from numeric HEX value
string(4) "ď"
string(2) "ď"

Get numeric value of character as DEC int
int(50319)
int(271)

Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
like image 32
John Slegers Avatar answered Oct 26 '22 14:10

John Slegers