Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rules for slugs and unicode

After researching a bit how the different way people slugify titles, I've noticed that it's often missing how to deal with non english titles.

url encoding is very restrictive. See http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

So, for example how do folks deal with for title slugs for things like

"Una lágrima cayó en la arena"

One can come up with a reasonable table for indo european languages, ie. things that can be encoded via ISO-8859-1. For example, a conversion table would translate 'á' => 'a', so the slug would be

"una-lagrima-cayo-en-la-arena"

However, I'm using unicode (in particular using UTF-8 encoding), so no guaranties about what sort code points I'm going to get (I have to prepare for things that can't be ISO-8859-1 encoded.

I a nushell. How do deal with this? Should I come up with a conversion table for chars in the ISO_8859-1 range (<255) and drop everything else?

EDIT: To give a bit more context, a priori, I don't really expect to slugify data in non indo european languages, but I'd like to have a plan if I encounter such data. A conversion table for the extended ASCII would be nice. Any pointers?

Also, since people are asking, I'm using python, running on Google App Engine

like image 376
bustrofedon Avatar asked May 04 '09 15:05

bustrofedon


3 Answers

Nearly-complete transliteration table (for latin, greek and cyrillic character sets) can be found in slughifi library. It is geared towards Django, but can be easily modified to fit general needs (I use it with Werkzeug-based app on AppEngine).

like image 103
zgoda Avatar answered Nov 18 '22 02:11

zgoda


I simply use utf-8 for URL paths. As long as the domain is non-IDN FF3, IE works fine with this. Google reads and displays them correctly. The IRI RFC allows Unicode. Just make sure you parse the incoming urls correctly.

like image 40
felixg Avatar answered Nov 18 '22 03:11

felixg


In general this is going to depend on the language you expect to get. If your primary userbase is Japanese, dropping everything but ISO-8859-1 characters is unlikely to go over well.

That said, one option might be to use transliteration mode, if your character set conversion library supports it. For example, with GNU iconv, one can do:

] echo Una lágrima cayó en la arena|iconv -f utf8 -t ascii//TRANSLIT
Una lagrima cayo en la arena

As you can see, the accented characters were automatically converted to something in the ASCII range. How to translate this to code will of course depend on the language you're using, but if your language is based on GNU iconv for charset conversion (and if it's on linux, it probably is), this trick can probably be applied directly by simply specifying "ascii//TRANSLIT" as the convert-to character set.

One thing to note with this, however, is it's only effective with characters that "look like" something in ASCII. For example:

] echo 我輩は猫である。名前はまだない。|iconv -f utf8 -t ascii//TRANSLIT                                               
????????????????

As you can see, it's not much help for Japanese, and needs further processing afterward to remove characters not suitable for URLs.

like image 2
bdonlan Avatar answered Nov 18 '22 02:11

bdonlan