Logo Questions Linux Laravel Mysql Ubuntu Git Menu

PHP Curly Quote Character Encoding Issue

I know there is an age-old issue with character encoding between different characters sets, but I'm stuck on one related to Window's "curly quotes".

We have a client that likes to copy-and-paste data into a text field and then post it out onto our app. That data will often have curly quotes in it. I used to use the following transform them into their normal counterparts:

function convert_smart_quotes($string)  { 

$badwordchars=array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");

$fixedwordchars=array("'", "'", '"', '"', '-', '--', '...');

return str_replace($badwordchars,$fixedwordchars,$string); 


This worked great for a few months. Then after some changes (we switches servers, made updates to the system, upgraded PHP, etc., etc.) we learned it doesn't work anymore. So, I take a look and I learn that the "curly quotes" are all changing into a different characters. In this case, they're turning into the following:

“ = ¡È

” = ¡É

‘ = ¡Æ

’ = ¡Ç

These characters then show up as the cursed "black diamond-question mark symbols" when saved in the database. The mySQL database is in latin1_swedish_ci as is the app the messages are received on. So, although I know utf-8 is better, it has to remain in latin1_swedish_ci, or ISO-8859-1, or else we'll have to rebuild everything... and that's out of the question.

My webpage, and form, are both posting in utf-8. If I change it to be in ISO-8859-1, the quotes become question marks instead.

I have tried searching the string for occurrences of "¡È" or "¡É" and replacing them with normal quotes, but I couldn't get that to work. I did it by adding the following to my above function:

$string = str_replace("xa1\xc8", '"', $string);
$string = str_replace("xa1\xc9", '"', $string);
$string = str_replace("xa1\xc6", "'", $string);
$string = str_replace("xa1\xc7", "'", $string);

I've been stuck on this for a couple hours now and haven't been able to find any real help online. As you can imagine, googleing "¡É" doesn't bring a very specific response.

Any guidance is appreciated!

like image 625
Kenton de Jong Avatar asked Dec 19 '22 18:12

Kenton de Jong

2 Answers

Your problem is that you are accepting UTF-8 input from your user and then inserting it into your database as if it were Latin1 (ISO-8859-1). (Note that latin1_swedish_ci is not an encoding but a collation (for Latin1). See this SO question on the difference. For the purpose of solving your character encoding question, the collation is not important.)

Rather than manually identifying important UTF-8 sequences and replacing them, you should use a robust method for converting your UTF-8 string to Latin1 such as iconv.

Note that this is a lossy conversion: some UTF-8 characters, such as curly quotes, don't exist in Latin1. You can choose to ignore those characters (replacing them with the empty string, or ?, or something else), or you can choose to transliterate them (replacing them with close equivalents, like " for a curly quote... but what do you do if someone puts in your form?

iconv will attempt to transliterate where it can:

// convert from utf8 to latin1, approximating out of range characters
// by the closest latin1 alternative where possible (//TRANSLIT)
$latinString = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $utf8String);

(You can also configure it to ignore all out of range characters — see iconv's documentation for more info.)

If you don't want to mess around with adding a new library, PHP also comes with the utf_decode function:

$latinString = utf_decode($utf8String);

However, PHP was not really designed with multiple character encodings in mind, so I prefer to stay away from the (sometimes buggy) standard library functions that deal with encoding.

You should also consider reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

like image 167
sjy Avatar answered Dec 24 '22 02:12


You can use below code to solve this problem.

$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');


$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'auto');

more information can be found on php documentation website.

like image 28
Rakesh Chandel Avatar answered Dec 24 '22 02:12

Rakesh Chandel