Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace/remove 4(+)-byte characters from a UTF-8 string in PHP?

Tags:

php

mysql

utf-8

It seems like MySQL does not support characters with more than 3 bytes in its default UTF-8 charset.

So, in PHP, how can I get rid of all 4(-and-more)-byte characters in a string and replace them with something like by some other character?

like image 957
Franz Avatar asked Dec 13 '11 15:12

Franz


People also ask

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

What is the opposite of UTF-8?

These methods differ in the number of bytes they need to store a character. UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names.

What is a UTF-8 multibyte character?

UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.

How many possible UTF-8 characters are there?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes.


2 Answers

NOTE: you should not just strip, but replace with replacement character U+FFFD to avoid unicode attacks, mostly XSS:

http://unicode.org/reports/tr36/#Deletion_of_Noncharacters

preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", $value); 
like image 84
glen Avatar answered Oct 14 '22 16:10

glen


Since 4-byte UTF-8 sequences always start with the bytes 0xF0-0xF7, the following should work:

$str = preg_replace('/[\xF0-\xF7].../s', '', $str); 

Alternatively, you could use preg_replace in UTF-8 mode but this will probably be slower:

$str = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $str); 

This works because 4-byte UTF-8 sequences are used for code points in the supplementary Unicode planes starting from 0x10000.

like image 21
nwellnhof Avatar answered Oct 14 '22 16:10

nwellnhof