It seems like MySQL does not support characters with more than 3 bytes in its default UTF-8 charset.
So, in PHP, how can I get rid of all 4(-and-more)-byte characters in a string and replace them with something like by some other character?
Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
These methods differ in the number of bytes they need to store a character. UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names.
UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes.
NOTE: you should not just strip, but replace with replacement character U+FFFD to avoid unicode attacks, mostly XSS:
http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", $value);
Since 4-byte UTF-8 sequences always start with the bytes 0xF0-0xF7
, the following should work:
$str = preg_replace('/[\xF0-\xF7].../s', '', $str);
Alternatively, you could use preg_replace
in UTF-8 mode but this will probably be slower:
$str = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $str);
This works because 4-byte UTF-8 sequences are used for code points in the supplementary Unicode planes starting from 0x10000
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With