Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I'd like to understand how a windows smart quote turns into "’"

Here's the workflow:

  1. user types in Word; Word changes a single apostrophe to a "smart quote"
  2. user pastes the test from word into a form on a web page; the page the form is in is encoded in UTF-8
  3. the data gets saved into a MySQL database with the encoding latin1
  4. when retrieved from the database by a PHP app (which assumes the database encoding is UTF-8) and displayed in a UTF-8 web page, the quote displays as ’

I realise there's a mismatch between the encoding of the input and output pages and the database. That I'm going to fix.

Shouldn't the character survive the trip to and from the database anyway?

And how does a single character (0x92 if I'm not confused) go through that process and come out the other end as three characters?

Can someone talk me through what's happening to the bytes at each stage of the process?

like image 874
AmbroseChapel Avatar asked Jan 15 '23 12:01

AmbroseChapel


1 Answers

Step 1:

Word converts ' to (Unicode codepoint U+2019, RIGHT SINGLE QUOTATION MARK).

Step 2:

is encoded into UTF-8 as E2 80 99

Step 3:

This appears to be where the problem occurs. It looks like the UTF-8 string is stored without conversion in the latin-1-encoded MySQL field:

E2 80 99 in latin-1 is ’.

Step 4:

Either here or in the previous step, that falsely used latin-1 string is converted to UTF-8.

’ in UTF-8 is C3 A2 E2 82 AC E2 84 A2.

This will display on a UTF-8-encoded website as ’.

like image 144
Tim Pietzcker Avatar answered Jan 18 '23 02:01

Tim Pietzcker