In PHP, we can use mb_check_encoding()
to determine if a string is valid UTF-8. But that's not a portable solution as it requires the mbstring extension to be compiled in and enabled. Additionally, it won't tell us which character is invalid.
Is there a regular expression (or another other 100% portable method) that can match invalid UTF-8 bytes in a given string?
That way, those bytes can be replaced if needed (keeping the binary information, such as when building a test output XML file that includes binary data). So converting the characters to UTF-8 would lose information. So, we may want to convert:
"foo" . chr(128) . chr(255)
Into
"foo<128><255>"
So just "detecting" that the string is not good enough, we'd need to be able to detect which characters are invalid.
Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.
Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.
UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.
You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.
$regex = '/( [\xC0-\xC1] # Invalid UTF-8 Bytes | [\xF5-\xFF] # Invalid UTF-8 Bytes | \xE0[\x80-\x9F] # Overlong encoding of prior code point | \xF0[\x80-\x8F] # Overlong encoding of prior code point | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start | (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle | (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence | (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence | (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2) )/x';
We can test it by creating a few variations of text:
// Overlong encoding of code point 0 $text = chr(0xC0) . chr(0x80); var_dump(preg_match($regex, $text)); // int(1) // Overlong encoding of 5 byte encoding $text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80); var_dump(preg_match($regex, $text)); // int(1) // Overlong encoding of 6 byte encoding $text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80); var_dump(preg_match($regex, $text)); // int(1) // High code-point without trailing characters $text = chr(0xD0) . chr(0x01); var_dump(preg_match($regex, $text)); // int(1)
etc...
In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:
preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points
Assuming PHP is compiled with PCRE, it most often is also enabled with UTF-8. So as explicitly asked for in the question, this very simple regular expression can detect invalid UTF-8 strings, because those won't match:
preg_match('//u', $string);
You can then argue that the u
modifier (PCRE_UTF8) is not always available, and true, this can happen as the this question shows:
u
flag dependent on?However, in my practical developer life, this never was an issue. It is more an issue that the PCRE extension is not available at all, which would render any answer containing PCRE as useless (even mine here). But most often that issue was more an issue of the past as of today minus some years.
A more lengthy answer similar to this one has been given in the somehow duplicate question:
So I think this question should highlight more of the benefits the suggested answer ships with.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With