Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

smart quotes not converting properly into UTF8

I have a PHP script that imports and parses XML files and saves the data into the database:

  • Database collation: utf8_general_ci, charset: utf8
  • Page's charset : utf-8
  • XML files: ANSI, contains smart quotes (from MS Word)

So during import I do a utf8_encode() on the text from the XML files prior to saving into the database and subsequently displaying on the page.

But when successfully imported, and saved into DB,

  • Database: smart quotes are saved as ? character (viewed from CMD)
  • Page: smart quotes are displayed as boxes

Any ideas as to why the smart quotes are not being converted correctly, even when using utf8_encode()?

EDIT:

@Tomalak: The XML files are actually .txt, no XML declaration (<?xml ... ?>), and no root element. My script actually adds a root element just so the parser works:

utf8_encode('<article>' . file_get_contents($xmlfile) . '</article>');

Seems like I need to add an XML declaration..? If so, how should it look like?

like image 627
Obay Avatar asked Jan 17 '23 10:01

Obay


2 Answers

If your XML string (i.e. file contents) is not encoded as UTF-8, you need an XML declaration that denotes the file encoding. If an XML declaration is missing, the parser will assume UTF-8.

As long as you do not use "special" characters (i.e. anything outside of the ASCII range), it will work without a declaration even if your file is not really UTF-8-encoded. This is because UTF-8 is byte-compatible to ASCII. But as soon as characters are used that are on one of the code pages — like the "smart quotes" — it will break because these are represented by different bytes in UTF-8.

In your case there are text files in a legacy encoding that you wrap with a root element to turn them into well-formed XML. Therefore you need to add the XML declaration yourself:

'<?xml encoding="Windows-1252"?><article>'.file_get_contents($xmlfile).'</article>'

This way you instruct the DOMDocument how to interpret the bytes in your string. I assumed Windows-1252 for you because you said ANSI and mentioned the curly quotes.

In fact, 95% of the time this is what people really mean, even on Linux and even if they say ISO-8859-1 (or latin-1), which is almost, but not exactly the same thing.

To be extra sure you can open your text files in a hex editor, spot a few special characters and compare their byte values with the suspected encoding. For Windows-1252. For the curly quotes the expected byte values would be:

  • 147 (0x93)
  • 148 (0x94)

Once the meaning of the individual bytes in your string is declared, DOMDocument can make sense of them and does the right thing.

When it comes to in the DB, I strongly suspect there is some automagic encoding conversion going on. I admit that I don't know enough about PHP/mySQL/Unicode integration to say for sure.

like image 189
Tomalak Avatar answered Jan 20 '23 14:01

Tomalak


They are being converted correctly, because utf8_encode does exactly what it's supposed to be doing and removes them from the input stream.

utf8_encode converts text in ISO-8859-1 into UTF-8 encoding. ISO-8859-1 does not contain the smartquote characters. Your webpage is probably being treated as MSWIN1252 encoding in the browser, which does contain smartquote characters, and you're forcing PHP's input into ISO-8859-1 by calling utf8_encode on it. This means that the only characters that will make it into your database are these ones: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout

To solve this, you have to make sure that the webpage itself is in UTF-8 encoding, and remove all the utf8_encode/utf8_decode statements from your code. This page explains how to properly work with UTF-8: http://malevolent.com/weblog/archive/2007/03/12/unicode-utf8-php-mysql/

like image 25
Joeri Sebrechts Avatar answered Jan 20 '23 16:01

Joeri Sebrechts