Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character encoding fail, why does \xBD display improperly in PHP + HTML

I'm just trying to understand character encoding a bit better, so I'm doing a few tests.

I have a PHP file that is saved as UTF-8 and looks like this:

<?php
declare(encoding='UTF-8');

header( 'Content-type: text/html; charset=utf-8' );
?><!DOCTYPE html>

<html>

<head>
    <meta charset="UTF-8" />
    <title>Test</title>
</head>

<body>
    <?php echo "\xBD"; # Does not work ?>
    <?php echo htmlentities( "\xBD" ) ; # Works ?>
</body>

</html>

The page itself shows this:

enter image description here

The gist of the problem is that my web application has a bunch of character encoding problems, where people are copying and pasting from Outlook or Word and the characters get transformed into the diamond question marks (Do those have a real name?)

I'm trying to learn how to make sure all my input is transformed into UTF-8 when the page loads (Basically $_GET, $_POST, and $_REQUEST), and all output is done using proper UTF-8 handling methods.


My question is: Why is my page showing the question mark for the first echo, and does anyone have any other information about making a UTF-8 safe web app in PHP?

like image 290
Brandon Wamboldt Avatar asked Apr 11 '26 06:04

Brandon Wamboldt


2 Answers

0xBD is not valid UTF-8. If you want to encode "½" in UTF-8 then you need to use 0xC2 0xBD instead.

>>> print '\xc2\xbd'.decode('utf-8')
½

If you want to use text from another charset (Latin-1 in this case) then you need to transcode it to UTF-8 first using the various iconv or mb functions.

Also:

$ charinfo �
U+FFFD REPLACEMENT CHARACTER
like image 91
Ignacio Vazquez-Abrams Avatar answered Apr 13 '26 18:04

Ignacio Vazquez-Abrams


\xBD is not valid as utf8 what you want is \xC2\xBD, the question mark thing is what applications replace invalid code points with, so if you see that in your utf8 text its either not utf8 or corrupted.

like image 44
Musa Avatar answered Apr 13 '26 18:04

Musa