Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: replace invalid characters in utf-8 string in

Tags:

regex

php

utf-8

How replace (use regex in PHP5) invalid characters in utf-8 string on white space characters?

like image 685
AexChecker Avatar asked Sep 16 '09 15:09

AexChecker


People also ask

How can I change a non UTF-8 character from a text file?

Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it. May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

How remove all special characters from a string in PHP?

Using str_ireplace() Method: The str_ireplace() method is used to remove all the special characters from the given string str by replacing these characters with the white space (” “).

How do I remove a non UTF-8 character from a text file in Linux?

To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.


4 Answers

use iconv

$text = iconv("UTF-8", "UTF-8//IGNORE", $text);

see the manual.

Cheers

like image 186
RageZ Avatar answered Sep 22 '22 10:09

RageZ


With mbstring you can do:

$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');

Will work as you want (replace invalid characters by whitespaces), but doesn't seem to work if you want to substitute invalid characters with something else, like ?.

See: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

like image 23
Frosty Z Avatar answered Sep 21 '22 10:09

Frosty Z


The iconv was not working my case (as other solutions) so I found mine here in the part for "Character validation":

http://webcollab.sourceforge.net/unicode.html

like image 36
bobef Avatar answered Sep 24 '22 10:09

bobef


If you have come across the cursed ‘Invalid Character‘ error while using PHP’s XML or JSON parser then you may be interested in this.

Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found the below code form net and work excellently for me..

//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '?', $some_string );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
 '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );
like image 37
George John Avatar answered Sep 21 '22 10:09

George John