Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML - £ pound symbol from database displayed as ? even with charset=UTF-8

We have a bunch of database data that someone manually entered. They contain a lot of British Pound (£) symbols. The original user copy/pasted the pound sign from somewhere, not sure where (I'm not sure if it matters or not...).

Anyways, when printing out the data on a PHP page, the pound signs appear as the replacement character. The page has <meta charset="utf-8"/> in it. In the browser, if you change the encoding to ISO-8859-1, then the pound signs show up correctly.

After some digging, I've come to the conclusion that the original data entry person copy/pasted an ISO-8859-1 encoded pound sign into the database. So unless the page is rendered using ISO-8859-1, it won't show up correctly.

Here is the header information from Chrome:

Request URL:http://www.mysite.com/test.php
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Cookie:X-Mapping-goahf....
Host:www.mysite.com
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2
Response Headersview source
Connection:Keep-Alive
Content-Type:text/html; charset=UTF-8
Date:Wed, 07 Dec 2011 22:38:14 GMT
Server:Apache/2.2
Transfer-Encoding:chunked

Also the MySQL table says it uses latin1_swedish_ci which was the default.

So how do I go about fixing this problem? I don't know a lot about how character encoding works and what happens when you copy/paste characters from one place to another.

I tried going to this page:

http://www.fileformat.info/info/unicode/char/a3/browsertest.htm

And copying the pound symbol and pasting it into the database, thinking that would fix it, but it didn't appear to... How do I make the pound symbol that is in the database a UTF-8 pound symbol instead of ISO-8859-1 ?

like image 202
Jake Wilson Avatar asked Dec 07 '11 23:12

Jake Wilson


People also ask

What does UTF-8 do in HTML?

The HTML5 Standard: Unicode UTF-8 Unicode enables processing, storage, and transport of text independent of platform and language. The default character encoding in HTML-5 is UTF-8.

What is the meaning of charset UTF-8?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


2 Answers

It does not matter where the original pound sign was copied from. It does not even matter in what encoding it is stored in the database. The database works on a character level, meaning that if you ask it to store the £ character, it stores the £ character; how exactly that happens behind the scenes and what encoding it's using to do that is an implementation detail that is irrelevant.

What you are missing is that there's a connection encoding. When you connect to the database, you are talking to it implicitly or explicitly using a certain character set. That means that any bytes you are sending to the database are expected to represent characters in that encoding (so the database knows what characters it's supposedly receiving) and any text data you receive from the database will be encoded in that encoding (so you know how you should treat the results). The default for that connection encoding is often the Latin-1 charset (a.k.a. ISO-8859-1). So when you receive the £ sign from the database, it's converting it on-the-fly into Latin-1, whatever encoding it was stored in in the database. So you are receiving the £ sign encoded in Latin-1 and are outputting it as is into your page, but you are specifying to the browser to interpret the page as UTF-8. That of course leads to a misinterpreted character.

You can change the connection defaults in various ways, either in the MySQL configuration, using certain methods in your client library (which you aren't specifying) or by issuing the query SET NAMES utf8; after connecting to the database.

like image 132
deceze Avatar answered Sep 29 '22 12:09

deceze


You can't simply take raw text in one encoding and use a utf8 meta tag to display it.

I don't know what coding is latin1_swedish_ci but quite possibly it is an alias to iso-8859-1. So either you convert the coding to UTF-8 or you fix the meta tag to show the correct coding.

If you are going to convert it I suggest iconv. You may have to make sure mysql knows the new coding too. Someone else seems to have gone through it at http://drupal.org/node/62258

like image 45
koan Avatar answered Sep 29 '22 12:09

koan