Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode character in PHP string

Tags:

php

unicode

People also ask

How to generate Unicode characters in PHP?

string str = "\u1000"; This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal ( 4096 in decimal).

What is Unicode in PHP?

Definition and Usage. The utf8_encode() function encodes an ISO-8859-1 string to UTF-8. Unicode is a universal standard, and has been developed to describe all possible characters of all languages plus a lot of symbols with one unique number for each character/symbol.

How to use UTF-8 encoding in PHP?

PHP UTF-8 Encoding – modifications to your php. The first thing you need to do is to modify your php. ini file to use UTF-8 as the default character set: default_charset = "utf-8"; (Note: You can subsequently use phpinfo() to verify that this has been set properly.)

Does PHP use Unicode?

PHP's lack of Unicode/multibyte support means that the standard string handling functions treat strings as a sequence of single-byte characters.


PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.

It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.

$unicodeChar = "\u{1000}";

Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:

$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');

Another option would be to use mb_convert_encoding()

echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');

or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:

echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');

I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:

\x[0-9A-Fa-f]{1,2}

The sequence of characters matching the regular expression is a character in hexadecimal notation.

ASCII example:

<?php
    echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>

Hello World!

So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:

<?php
    header('content-type:text/html;charset=utf-16be');
    echo("\x30\xA2");
?>

If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).

UTF-16 little endian example:

<?php
    header('content-type:text/html;charset=utf-16le');
    echo("\xA2\x30");
?>

UTF-8 example:

<?php
    header('content-type:text/html;charset=utf-8');
    echo("\xE3\x82\xA2");
?>

There is also the pack function, but you can expect it to be slow.


PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:

function unicodeString($str, $encoding=null) {
    if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
    return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}

Or with an anonymous function expression instead of create_function:

function unicodeString($str, $encoding=null) {
    if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
    return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
        return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
    }, $str);
}

Its usage:

$str = unicodeString("\u1000");