Getting weird characters like  instead of or ’? Most likely there is a Character set problem. It can occur when a MySQL and PHP are upgraded or when data has been incorrectly stored or the application is sending an incorrect (or missing) character set to the browser.
It is a character encoding issue. Whom ever is sending the mail is using a character set that is not appropriate. View menu (Alt+V) > character encoding and select UTF-8 or unicode should see the correct display.
So what's the problem,
It's a ’
(RIGHT SINGLE QUOTATION MARK
- U+2019) character which is being decoded as CP-1252 instead of UTF-8. If you check the encodings table, then you see that this character is in UTF-8 composed of bytes 0xE2
, 0x80
and 0x99
. If you check the CP-1252 code page layout, then you'll see that each of those bytes stand for the individual characters â
, €
and ™
.
and how can I fix it?
Use UTF-8 instead of CP-1252 to read, write, store, and display the characters.
I have the Content-Type set to UTF-8 in both my
<head>
tag and my HTTP headers:<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
This only instructs the client which encoding to use to interpret and display the characters. This doesn't instruct your own program which encoding to use to read, write, store, and display the characters in. The exact answer depends on the server side platform / database / programming language used. Do note that the one set in HTTP response header has precedence over the HTML meta tag. The HTML meta tag would only be used when the page is opened from local disk file system instead of from HTTP.
In addition, my browser is set to
Unicode (UTF-8)
:
This only forces the client which encoding to use to interpret and display the characters. But the actual problem is that you're already sending ’
(encoded in UTF-8) to the client instead of ’
. The client is correctly displaying ’
using the UTF-8 encoding. If the client was misinstructed to use, for example ISO-8859-1, you would likely have seen ââ¬â¢
instead.
I am using ASP.NET 2.0 with a database.
This is most likely where your problem lies. You need to verify with an independent database tool what the data looks like.
If the ’
character is there, then you aren't connecting to the database correctly. You need to tell the database connector to use UTF-8.
If your database contains ’
, then it's your database that's messed up. Most probably the tables aren't configured to use UTF-8
. Instead, they use the database's default encoding, which varies depending on the configuration. If this is your issue, then usually just altering the table to use UTF-8 is sufficient. If your database doesn't support that, you'll need to recreate the tables. It is good practice to set the encoding of the table when you create it.
You're most likely using SQL Server, but here is some MySQL code (copied from this article):
CREATE DATABASE db_name CHARACTER SET utf8;
CREATE TABLE tbl_name (...) CHARACTER SET utf8;
If your table is however already UTF-8, then you need to take a step back. Who or what put the data there. That's where the problem is. One example would be HTML form submitted values which are incorrectly encoded/decoded.
Here are some more links to learn more about the problem:
Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.
Or use ’
.
’
(Unicode codepoint U+2019 RIGHT SINGLE QUOTATION MARK
) is encoded in UTF-8 as bytes:
0xE2 0x80 0x99
.
’
(Unicode codepoints U+00E2 U+20AC U+2122
) is encoded in UTF-8 as bytes:
0xC3 0xA2
0xE2 0x82 0xAC
0xE2 0x84 0xA2
.
These are the bytes your browser is actually receiving in order to produce ’
when processed as UTF-8.
That means that your source data is going through two charset conversions before being sent to the browser:
The source ’
character (U+2019
) is first encoded as UTF-8 bytes:
0xE2 0x80 0x99
those individual bytes were then being mis-interpreted and decoded to Unicode codepoints U+00E2 U+20AC U+2122
by one of the Windows-125X charsets (1252, 1254, 1256, and 1258 all map 0xE2 0x80 0x99
to U+00E2 U+20AC U+2122
), and then those codepoints are being encoded as UTF-8 bytes:
0xE2
-> U+00E2
-> 0xC3 0xA2
0x80
-> U+20AC
-> 0xE2 0x82 0xAC
0x99
-> U+2122
-> 0xE2 0x84 0xA2
You need to find where the extra conversion in step 2 is being performed and remove it.
I have some documents where …
was showing as …
and ê
was showing as ê
. This is how it got there (python code):
# Adam edits original file using windows-1252
windows = '\x85\xea'
# that is HORIZONTAL ELLIPSIS, LATIN SMALL LETTER E WITH CIRCUMFLEX
# Beth reads it correctly as windows-1252 and writes it as utf-8
utf8 = windows.decode("windows-1252").encode("utf-8")
print(utf8)
# Charlie reads it *incorrectly* as windows-1252 writes a twingled utf-8 version
twingled = utf8.decode("windows-1252").encode("utf-8")
print(twingled)
# detwingle by reading as utf-8 and writing as windows-1252 (it's really utf-8)
detwingled = twingled.decode("utf-8").encode("windows-1252")
assert utf8==detwingled
To fix the problem, I used python code like this:
with open("dirty.html","rb") as f:
dt = f.read()
ct = dt.decode("utf8").encode("windows-1252")
with open("clean.html","wb") as g:
g.write(ct)
(Because someone had inserted the twingled version into a correct UTF-8 document, I actually had to extract only the twingled part, detwingle it and insert it back in. I used BeautifulSoup for this.)
It is far more likely that you have a Charlie in content creation than that the web server configuration is wrong. You can also force your web browser to twingle the page by selecting windows-1252 encoding for a utf-8 document. Your web browser cannot detwingle the document that Charlie saved.
Note: the same problem can happen with any other single-byte code page (e.g. latin-1) instead of windows-1252.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With