Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: How to encode U+FFFD in order to do a replace?

I'm trying to display a data feed on a page. We're experiencing encoding issues with a weird character. For some reason, in the feed there's the U+FFFD character. And htmlentities() will not escape the character, so I need to replace it manually. (I'm using PHP 5.3)

I've tried the following:

$string = str_replace( "\xFFFD",  "_", $string );
$string = str_replace( "\XFFFD",  "_", $string );
$string = str_replace( "\uFFFD",  "_", $string );
$string = str_replace("\x{FFFD}", "_", $string );
$string = str_replace("\X{FFFD}", "_", $string );
$string = str_replace("\P{FFFD}", "_", $string );
$string = str_replace("\p{FFFD}", "_", $string );

None of the above work.

After reading this page - http://php.net/manual/en/regexp.reference.unicode.php - I'm not sure what I'm doing wrong. Do I need to compile UTF-8 support into PCRE?

like image 402
redolent Avatar asked Dec 05 '12 15:12

redolent


1 Answers

You should attempt to fix the original problem, FFFD (The unicode replacement character) is not in most cases meant to be a real text character but a sign that something was attempted to be decoded in an UTF encoding but that something was not actually encoded in an UTF encoding. It is an alternative to silently discarding invalid bytes or completely halting the decoding process, either way, if you see it, there was an error.

There is no way to know what the original character was. Especially with your solution, since you replace the character with _, you cannot even know that the original source was decoded incorrectly. You should go back to the source and decode it properly.

Note: It's possible for a source text to use as a literal, normal character, for instance when talking about it, and there is no error then. I am excluding this possibility in my answer.

like image 122
Esailija Avatar answered Sep 20 '22 07:09

Esailija