Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect a malformed UTF-8 string in PHP?

The iconv function sometimes gives me an error:

Notice: iconv() [function.iconv]: Detected an incomplete multibyte character in input string in [...] 

Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?

like image 787
rsk82 Avatar asked Jul 17 '11 11:07

rsk82


People also ask

Does PHP support UTF-8?

PHP | utf8_encode() FunctionThe utf8_encode() function is an inbuilt function in PHP which is used to encode an ISO-8859-1 string to UTF-8. Unicode has been developed to describe all possible characters of all languages and includes a lot of symbols with one unique number for each symbol/character.

Does STD string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

What are UTF-8 strings?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.


1 Answers

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string); 

Another possibility is mb_check_encoding [PHP Manual]:

$validUTF8 = mb_check_encoding($string, 'UTF-8'); 

Another function you can use is mb_detect_encoding [PHP Manual]:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true)); 

It's important to set the strict parameter to true.

Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL; echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL; 

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string)); 

Check the examples on the iconv manual page as well.

like image 155
hakre Avatar answered Sep 23 '22 10:09

hakre