Here's the workflow:
latin1
I realise there's a mismatch between the encoding of the input and output pages and the database. That I'm going to fix.
Shouldn't the character survive the trip to and from the database anyway?
And how does a single character (0x92 if I'm not confused) go through that process and come out the other end as three characters?
Can someone talk me through what's happening to the bytes at each stage of the process?
Step 1:
Word converts '
to ’
(Unicode codepoint U+2019
, RIGHT SINGLE QUOTATION MARK
).
Step 2:
’
is encoded into UTF-8 as E2 80 99
Step 3:
This appears to be where the problem occurs. It looks like the UTF-8 string is stored without conversion in the latin-1-encoded MySQL field:
E2 80 99
in latin-1 is ’
.
Step 4:
Either here or in the previous step, that falsely used latin-1 string is converted to UTF-8.
’
in UTF-8 is C3 A2 E2 82 AC E2 84 A2
.
This will display on a UTF-8-encoded website as ’
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With