Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are you fluent in Unicode yet?

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?

I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.

So for my benefit (and I believe many others) can I get some input from people on the following:

  • How to "get over" ASCII once and for all
  • Fundamental guidance when working with Unicode.
  • Recommended (recent) books and websites on Unicode (for developers).
  • Current state of Unicode (5 years after Joels' article)
  • Future directions.

I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.

Update: See this related question also asked on StackOverflow previously.

like image 425
Ash Avatar asked Sep 12 '08 14:09

Ash


2 Answers

Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.

Here some interesting articles (besides Joel's article) on the subject:

  • http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
  • http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

A quote from the first article; Tips for using Unicode:

  • Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
  • Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
  • Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
  • Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
  • If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
  • If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
  • Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
  • Spend some time poking around the Unicode web site and learning how the code charts work.
  • If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
  • If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
  • If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.
like image 97
fijter Avatar answered Oct 12 '22 23:10

fijter


I spent a while working with search engine software - You wouldn't believe how many web sites serve up content with HTTP headers or meta tags which lie about the encoding of the pages. Often, you'll even get a document which contains both ISO-8859 characters and UTF-8 characters.

Once you've battled through a few of those sorts of issues, you start taking the proper character encoding of data you produce really seriously.

like image 34
Matt Sheppard Avatar answered Oct 12 '22 23:10

Matt Sheppard