Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-ascii characters in URL

I ran into a new problem that I've never seen before: My client is adding files to a project we built and some of the filenames have special characters in them because some of the words are spanish.

For example a file I'm testing has an á in it. I am calling that image in a css file as a background image but in Safari it doesnt show up. But it does on FF and Chrome.

As a test I pasted the link into the browser and the same thing. Works on FF and Chrome but Safari throws an error. So the language characters are throwing it I guess?

Firefox converts the following url and changes the á to a%CC%81 and loads the image.

http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Clássico_foto-Henrique-Peron-470x120-1371827671.jpg

You can see it breaks above... but FF and Chrome convert that to: http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Cla%CC%81ssico_foto-Henrique-Peron-470x120-1371827671.jpg

You can also see this in action here: http://jsfiddle.net/Md4gZ/2/

.testbox { width:340px; height:100px; background:url('http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Clássico_foto-Henrique-Peron-470x120-1371827671.jpg') no-repeat top left; }

So whats the right way to handle this. I'm developing in PHP and WORDPRESS. I'd rather not have to tell the client to go back and replace all files with special characters.

Any help is appreciated. Thanks!

like image 836
dave Avatar asked Jun 21 '13 19:06

dave


People also ask

What characters are non-ASCII?

Overview. Non-ASCII characters are those that are not encoded in ASCII, such as Unicode, EBCDIC, etc. ASCII is limited to 128 characters and was initially developed for the English language. In this tutorial, we'll look at some tools to find and highlight non-ASCII characters within text files.

Do URLs have to be ASCII?

URLs can only be sent over the Internet using the ASCII character-set. Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.

What characters are allowed in a URL?

There are only certain characters that are allowed in the URL string, alphabetic characters, numerals, and a few characters ; , / ? : @ & = + $ - _ . ! ~ * ' ( ) # that can have special meanings.

What characters should not be used in a URL?

These characters are { , } , | , \ , ^ , ~ , [ , ] , and ` . All unsafe characters must always be encoded within a URL.


1 Answers

I believe what is becoming the standard is to convert non-ascii characters to UTF-8 byte sequences, and include those sequences as %HH hex codes in the URL. The á character is U+00E1 (Unicode), which in UTF-8 makes the two bytes 0xC3 0xA1. Hence, Clássico would become Cl%C3%A1ssico.

The conversion you report from Firefox, Cla%CC%81ssico, did this slightly differently: it changed the á into a followed by U+0301, the COMBINING ACUTE ACCENT character. In UTF-8, U+0301 makes 0xCC 0x81.

Which representation you should choose – unicode “á” or “a followed by combining accent” – depends on what the web server needs for matching the right thing. In your case, maybe the filename actually contains the combining-character accent, and that's why it worked (hard to tell).

Another, older, way to handle non-ascii latin characters is to use an 8-bit latin charset representation (ISO-8859-1 or something similar, such as Windows-1252) and encode that as one byte. That would make Clássico into Cl%E1ssico. But since this only works for latin charsets, and is ambiguous for some of their characters, it is hopefully and probably disappearing.

like image 71
njlarsson Avatar answered Oct 22 '22 18:10

njlarsson