Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change Website Character encoding from iso-8859-1 to UTF-8

About 2 years ago I made the mistake of starting a large website using iso-8859-1. I now am having issues with some characters, especially when sending data to the server using ajax. Because of this, I would like to switch to using UTF-8.

What issues do you see coming from this? I know I would have to search the site to look for characters that need to be changed from a ? to their real characters. But, are there any other risks in doing this? Has anyone done this before?

like image 290
Nic Hubbard Avatar asked Oct 20 '09 22:10

Nic Hubbard


2 Answers

Such a change touches (nearly) every part of your system. You need to go through everything, from the database to the PHP to the HTML to the web browser.

Start a test site and subject it to some serious testing (various browsers on various platforms doing various things).

IMO it's important to actually get familiar with UTF-8 and what it means for software. A few quick points:

  • PHP is mostly byte-oriented. Learn the difference between characters and code points and bytes, and between UTF-8 and Unicode.
  • UTF-8 is well-designed. For instance, given two UTF-8 strings, a byte-oriented strstr() will still function correctly.
  • The most common problem is treating a UTF-8 string as ISO-8859-1 and vice versa - you may need to add documentation to your functions stating what kind of encoding they expect, to make these sorts of errors less likely. A variable naming convention for your strings (to indicate what encoding they use) may also help.
like image 31
Artelius Avatar answered Sep 19 '22 03:09

Artelius


The main difficulty is making sure you've checked that all the data paths are UTF-8 clean:

  1. Is your site DB-backed? If so, you'll need to convert all the tables to UTF-8 or some other Unicode encoding, so sorting and text searching work correctly.

  2. Is your site using some programming language for dynamic content? (PHP, mod_perl, ASP...?) If so, you'll have to make sure the particular language interpreter you're using fully understands some form of Unicode, work out the conversions if it isn't using UTF-8 natively — UTF-16 is next most common — and check that it's configured to use UTF-8 on its output to the web server.

  3. Does your site have some kind of back-end app server? Does it use UTF-8 for its text outputs?

  4. There are at least three different places you can declare the charset for a web document. Be sure you change them all:

    • the HTTP Content-Type header
    • the <meta http-equiv="Content-Type"> tag in your documents' <head>
    • the <?xml> tag at the top of the document, if using XHTML Strict

All this comes from my experiences a years ago when I traced some Unicode data through a moderately complex N-tier app, and found conversion chains like:

Latin-1 → UTF-8 → Latin-1 → UTF-8

So, even though the data ended up in the browser claiming to be "UTF-8", the app could still only handle the subset common with Latin-1.

The biggest reason for those odd conversion chains was due to immature Unicode support in the tooling at the time, but you can still find yourself messing with ugliness like this if you're not careful to make the pipeline UTF-8 clean.

As for your comments about searching out Latin-1 characters and converting files one by one, I wouldn't do that. I'd build a script around the iconv utility found on every modern Linux system, feeding in every text file in your system, explicitly converting it from Latin-1 to UTF-8. Leave no stone unturned.

like image 178
Warren Young Avatar answered Sep 20 '22 03:09

Warren Young