I'm having some nasty character encoding problems that I just can't figure out.
Essentially, I'm screen scraping some HTML off of a site using PHP, then running it through PHP's DOMDocument to change out some URL's, etc., and when it's done, it outputs HTML with some weird things. Ex: where there should be an end quote, it puts out ”
I have the page's meta tag for charset set to utf-8
but then the ”
characters are showing up as â€
on the site. I'm not sure if I just don't understand character encoding, or what.
Any suggestions on the best way to resolve this? Something client side with a meta tag, or some kind of server-side PHP conversion?
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things.
Most libraries that don't hold a lot of foreign language materials will be perfectly fine with ISO8859-1 ( also called Latin-1 or extended ASCII) encoding format, but if you do have a lot of foreign language materials you should choose UTF-8 since that provides access to a lot more foreign characters.
UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8. UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.
Sometimes setting the charset in HTML or the response header isn't enough. If everything isn't setup for UTF-8 on your server, your text may get incorrectly converted somewhere along the way. You may need to enable UTF-8 encoding for both Apache and PHP right in their config files. (If you're not using Apache, try skipping that step.)
Edit either your charset.conf (ideal), or httpd.conf file, by adding this line to the end:
AddDefaultCharset utf-8
(If you don't have access to Apache's config files, you can create a ".htaccess" file in your HTML's root directory with that same code.)
Edit your php.ini file, searching for "default_charset", and change it to:
default_charset = "utf-8"
Depending on your server type, this command may do the trick via command line:
sudo service apache2 restart
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With