How replace (use regex in PHP5) invalid characters in utf-8 string on white space characters?
Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it. May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.
Using str_ireplace() Method: The str_ireplace() method is used to remove all the special characters from the given string str by replacing these characters with the white space (” “).
To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.
use iconv
$text = iconv("UTF-8", "UTF-8//IGNORE", $text);
see the manual.
Cheers
With mbstring you can do:
$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Will work as you want (replace invalid characters by whitespaces), but doesn't seem to work if you want to substitute invalid characters with something else, like ?
.
See: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored
The iconv was not working my case (as other solutions) so I found mine here in the part for "Character validation":
http://webcollab.sourceforge.net/unicode.html
If you have come across the cursed ‘Invalid Character‘ error while using PHP’s XML or JSON parser then you may be interested in this.
Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found the below code form net and work excellently for me..
//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
'|[\x00-\x7F][\x80-\xBF]+'.
'|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
'|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
'|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
'?', $some_string );
//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
'|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With