My Rails 3.2.2 / Ruby 1.9.3 application gets search requests such as:
http://booko.com.au/books/search?q=Fran%E7ois+Vergniolle+de+Chantal
Ruby / Rails takes this query and decodes it - but assumes it's UTF-8. At some point I get a :
invalid byte sequence in UTF-8
app/models/product.rb:694:in `upcase' 
I think it's doing something like this:
q="Fran%E7ois+Vergniolle+de+Chantal"
=> "Fran%E7ois+Vergniolle+de+Chantal"
CGI.unescape( q )
=> "Fran\xE7ois Vergniolle de Chantal"
CGI.unescape( q ).encoding.name
=> "UTF-8"
CGI.unescape( q ).valid_encoding?
=> false
What is the correct way of dealing with this? I'd like to transcode it to the correct encoding - but how do I determine the current encoding? What I'm currently doing, is just assuming it's LATIN1:
q.encode!("ISO-8859-1", "UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
Or doing something I found on a blog somewhere:
q = q.unpack('C*').pack('U*')
What's the right way of dealing with this?
Edit The server is correctly sending "Content-Type: text/html; charset=utf-8" header to the client. The page also contains the appropriate meta tag: 'meta http-equiv="content-type" content="text/html;charset=UTF-8"'
Not sure if there's another method to tell the client which encodings to use?
The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.
In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.
See here for other ways to encode ç.
As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.
If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:
unless query.valid_encoding?
    query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end
Here's the process in IRB (plus an escaping at the end for fun)
a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1")    # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With