Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do URL encoding of ASCII characters?

I'm using Ruby to extract a URL of a file to download and download it. The file name has utf8 characters, ex:

www.domain.com/.../ÖÇÄÜ360ÓïÒôÖúÀí.txt

When trying to download the above URL, it fails. Using URI::escape produces a URI that also doesn't work:

www.domain.com/.../%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt

But if I follow the URL Encoding Reference, it works:

www.domain.com/.../%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt

I tried to search for a function in Ruby that does the exact same encoding, but I couldn't find any. Before I try to write a function that implements the table in the link above, I want to ask if anyone know any existing library that does this. And if I decide to do this, what range of characters I should encode, obviously, not everything.

I'm using JRuby 1.6.2 with RUBY_VERSION => "1.8.7"

like image 251
Rami Avatar asked May 10 '12 20:05

Rami


1 Answers

Oh, the joys of character encodings!

What’s happening here is as follows. Ruby internally is storing the string you have extracted as a sequence of bytes that is the utf-8 encoding of the name of the file. When you call URI.escape on it, those bytes are escaped in %xy format, and the resulting string, which now consists solely of bytes in the ASCII range, is used as the url.

The receiving server however, is interpreting those bytes (after unescaping them from %xy form) as if they were in a different encoding, in this case ISO-8859-1, and so the resulting filename it comes up with doesn’t match anything it has.

Here’s a demonstration using Ruby 1.9, as it has better support for encodings.

1.9.3-p194 :003 > f
 => "ÖÇÄÜ360ÓïÒôÖúÀí.txt" 
1.9.3-p194 :004 > f.encoding
 => #<Encoding:UTF-8> 
1.9.3-p194 :005 > URI.escape f
 => "%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt" 
1.9.3-p194 :006 > g = f.encode 'iso-8859-1'
 => "\xD6\xC7\xC4\xDC360\xD3\xEF\xD2\xF4\xD6\xFA\xC0\xED.txt" 
1.9.3-p194 :007 > g.encoding
 => #<Encoding:ISO-8859-1> 
1.9.3-p194 :008 > URI.escape g
 => "%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt"

The solution in this case is therefore to encode the string as ISO-8859-1 before escaping it. In Ruby 1.9 you do this as above, in earlier versions you can use Iconv (I’m assuming JRuby includes Iconv, I’m actually not that familiar with JRuby):

1.8.7 :001 > f
 => "\303\226\303\207\303\204\303\234360\303\223\303\257\303\222\303\264\303\226\303\272\303\200\303\255.txt" 
1.8.7 :005 > g = Iconv.conv('iso-8859-1', 'utf-8', f)
 => "\326\307\304\334360\323\357\322\364\326\372\300\355.txt" 
1.8.7 :006 > URI.escape f
 => "%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt" 
1.8.7 :007 > URI.escape g
 => "%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt" 

Note that in general you can’t depend on the server using any particular encoding. It should be using utf-8, but obviously isn’t in this case.

like image 100
matt Avatar answered Oct 13 '22 14:10

matt