Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine character encoding in Ruby 1.9.3

My Rails 3.2.2 / Ruby 1.9.3 application gets search requests such as:

http://booko.com.au/books/search?q=Fran%E7ois+Vergniolle+de+Chantal

Ruby / Rails takes this query and decodes it - but assumes it's UTF-8. At some point I get a :

invalid byte sequence in UTF-8
app/models/product.rb:694:in `upcase' 

I think it's doing something like this:

q="Fran%E7ois+Vergniolle+de+Chantal"
=> "Fran%E7ois+Vergniolle+de+Chantal"

CGI.unescape( q )
=> "Fran\xE7ois Vergniolle de Chantal"

CGI.unescape( q ).encoding.name
=> "UTF-8"

CGI.unescape( q ).valid_encoding?
=> false

What is the correct way of dealing with this? I'd like to transcode it to the correct encoding - but how do I determine the current encoding? What I'm currently doing, is just assuming it's LATIN1:

q.encode!("ISO-8859-1", "UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Or doing something I found on a blog somewhere:

q = q.unpack('C*').pack('U*')

What's the right way of dealing with this?

Edit The server is correctly sending "Content-Type: text/html; charset=utf-8" header to the client. The page also contains the appropriate meta tag: 'meta http-equiv="content-type" content="text/html;charset=UTF-8"'

Not sure if there's another method to tell the client which encodings to use?

like image 449
dkam Avatar asked Mar 21 '12 06:03

dkam


1 Answers

The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.

In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.

See here for other ways to encode ç.

As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.

If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:

unless query.valid_encoding?
    query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end

Here's the process in IRB (plus an escaping at the end for fun)

a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1")    # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"
like image 114
dkam Avatar answered Nov 09 '22 07:11

dkam