I'm using Ruby to extract a URL of a file to download and download it. The file name has utf8 characters, ex:
www.domain.com/.../ÖÇÄÜ360ÓïÒôÖúÀí.txt
When trying to download the above URL, it fails. Using URI::escape
produces a URI that also doesn't work:
www.domain.com/.../%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt
But if I follow the URL Encoding Reference, it works:
www.domain.com/.../%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt
I tried to search for a function in Ruby that does the exact same encoding, but I couldn't find any. Before I try to write a function that implements the table in the link above, I want to ask if anyone know any existing library that does this. And if I decide to do this, what range of characters I should encode, obviously, not everything.
I'm using JRuby 1.6.2 with RUBY_VERSION => "1.8.7"
Oh, the joys of character encodings!
What’s happening here is as follows. Ruby internally is storing the string you have extracted as a sequence of bytes that is the utf-8 encoding of the name of the file. When you call URI.escape
on it, those bytes are escaped in %xy
format, and the resulting string, which now consists solely of bytes in the ASCII range, is used as the url.
The receiving server however, is interpreting those bytes (after unescaping them from %xy
form) as if they were in a different encoding, in this case ISO-8859-1, and so the resulting filename it comes up with doesn’t match anything it has.
Here’s a demonstration using Ruby 1.9, as it has better support for encodings.
1.9.3-p194 :003 > f
=> "ÖÇÄÜ360ÓïÒôÖúÀí.txt"
1.9.3-p194 :004 > f.encoding
=> #<Encoding:UTF-8>
1.9.3-p194 :005 > URI.escape f
=> "%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt"
1.9.3-p194 :006 > g = f.encode 'iso-8859-1'
=> "\xD6\xC7\xC4\xDC360\xD3\xEF\xD2\xF4\xD6\xFA\xC0\xED.txt"
1.9.3-p194 :007 > g.encoding
=> #<Encoding:ISO-8859-1>
1.9.3-p194 :008 > URI.escape g
=> "%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt"
The solution in this case is therefore to encode the string as ISO-8859-1 before escaping it. In Ruby 1.9 you do this as above, in earlier versions you can use Iconv (I’m assuming JRuby includes Iconv, I’m actually not that familiar with JRuby):
1.8.7 :001 > f
=> "\303\226\303\207\303\204\303\234360\303\223\303\257\303\222\303\264\303\226\303\272\303\200\303\255.txt"
1.8.7 :005 > g = Iconv.conv('iso-8859-1', 'utf-8', f)
=> "\326\307\304\334360\323\357\322\364\326\372\300\355.txt"
1.8.7 :006 > URI.escape f
=> "%C3%96%C3%87%C3%84%C3%9C360%C3%93%C3%AF%C3%92%C3%B4%C3%96%C3%BA%C3%80%C3%AD.txt"
1.8.7 :007 > URI.escape g
=> "%D6%C7%C4%DC360%D3%EF%D2%F4%D6%FA%C0%ED.txt"
Note that in general you can’t depend on the server using any particular encoding. It should be using utf-8, but obviously isn’t in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With