I'd like to understand how a windows smart quote turns into "â€™"

Question

Here's the workflow:

user types in Word; Word changes a single apostrophe to a "smart quote"
user pastes the test from word into a form on a web page; the page the form is in is encoded in UTF-8
the data gets saved into a MySQL database with the encoding latin1
when retrieved from the database by a PHP app (which assumes the database encoding is UTF-8) and displayed in a UTF-8 web page, the quote displays as â€™

I realise there's a mismatch between the encoding of the input and output pages and the database. That I'm going to fix.

Shouldn't the character survive the trip to and from the database anyway?

And how does a single character (0x92 if I'm not confused) go through that process and come out the other end as three characters?

Can someone talk me through what's happening to the bytes at each stage of the process?

Tim Pietzcker · Accepted Answer

Step 1:

Word converts ' to ’ (Unicode codepoint U+2019, RIGHT SINGLE QUOTATION MARK).

Step 2:

’ is encoded into UTF-8 as E2 80 99

Step 3:

This appears to be where the problem occurs. It looks like the UTF-8 string is stored without conversion in the latin-1-encoded MySQL field:

E2 80 99 in latin-1 is â€™.

Step 4:

Either here or in the previous step, that falsely used latin-1 string is converted to UTF-8.

â€™ in UTF-8 is C3 A2 E2 82 AC E2 84 A2.

This will display on a UTF-8-encoded website as â€™.

Donate For Us