Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can php detect 4-byte encoded utf8 chars?

Tags:

php

utf8mb4

I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "𡃁","𨋢","𠵱","𥄫","𠽌","唧","𠱁". The table will popup error or skip the following texts.

How can I programmatically detect 4-byte encoded utf8 characters in PHP and replace them?

like image 261
Abby Chau Yu Hoi Avatar asked May 11 '13 11:05

Abby Chau Yu Hoi


People also ask

How many bytes is a UTF-8 character?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.

How do I check if a string is UTF-8 in PHP?

is_utf8() – check for UTF-8 With this PHP function it's possible to check whether a string is encoded as UTF-8 or not, or seems to be, at least. It scans a string for invalid UTF-8 characters (or bytes) and returns false, if it finds any.

Is UTF-8 a multi byte?

UTF-8. UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.

How many characters can UTF-8 represent?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.


2 Answers

The following regular expression will replace 4-byte UTF-8 characters:

function replace4byte($string, $replacement = '') {
    return preg_replace('%(?:
          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )%xs', $replacement, $string);    
}

var_dump(replace4byte('d'), replace4byte('d𡃁d'));

This doesn't rely on the /u modifier, so you shouldn't need to worry about UTF-8 for PCRE being compiled in. However, if you have that support, deceze's preg_replace_callback is neater.

(Regex adapted from Ensuring valid utf-8 in PHP)

like image 150
cmbuckley Avatar answered Oct 10 '22 22:10

cmbuckley


This should work:

if (max(array_map('ord', str_split($string))) >= 240) 

The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

If you want to remove long characters, this will do:

preg_replace_callback('/./u', function (array $match) {
    return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)

Though there may be a more elegant regex way to express high codepoints directly.

like image 30
deceze Avatar answered Oct 10 '22 22:10

deceze