Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rails, Heroku and invalid byte sequence in UTF-8 error

I have a queue of text messages in Redis. Let's say a message in redis is something like this:

"niño" 

(spot the non standard character).

The rails app displays the queue of messages. When I test locally (Rails 3.2.2, Ruby 1.9.3) everything is fine, but on Heroku cedar (Rails 3.2.2, I believe there is ruby 1.9.2) I get the infamous error: ActionView::Template::Error (invalid byte sequence in UTF-8)

After reading and rereading all I could find online I am still stuck as to how to fix this.

Any help or point to the right direction is greatly appreciated!

edit:

I managed to find a solution. I ended up using Iconv:

string = Iconv.iconv('UTF-8', 'ISO-8859-1', message)[0]

None of the suggested answers i found around seem to work in my case.

like image 850
klaut Avatar asked Apr 06 '12 16:04

klaut


1 Answers

On Heroku, when your app receives the message "niño" from Redis, it is actually getting the four bytes:

 0x6e 0x69 0xf1 0x6f

which, when interpreted as ISO-8859-1 correspond to the characters n, i, ñ and o.

However, your Rails app assumes that these bytes should be interpreted as UTF-8, and at some point it tries to decode them this way. The third byte in this sequence, 0xf1 looks like this:

1 1 1 1 0 0 0 1

If you compare this to the table on the Wikipedia page, you can see this byte is the leading byte of a four byte character (it matches the pattern 11110xxx), and as such should be followed by three more continuation bytes that all match the pattern 10xxxxxx. It's not, instead the next byte is 0x6f (01101111), and so this is invalid utf-8 byte sequence and you get the error you see.

Using:

string = message.encode('utf-8', 'iso-8859-1')

(or the Iconv equivalent) tells Ruby to read message as ISO-8859-1 encoded, and then to create the equivalent string in UTF-8 encoding, which you can then use without problems. (An alternative could be to use force_encoding to tell Ruby the correct encoding of the string, but that will likely cause problems later when you try to mix UTF-8 and ISO-8859-1 strings).

In UTF-8, the string "niño" corresponds to the bytes:

0x6e 0x69 0xc3 0xb1 0x6f

Note that the first, second and last bytes are the same. The ñ character is encoded as the two bytes 0xc3 0xb1. If you write these out in binary and compare to the table in the Wikipedia again article you'll see they encode 0xf1, which is the ISO-8859-1 encoding of ñ (since the first 256 unicode codepoints match ISO-8859-1).

If you take these five bytes and treat them as being ISO-8859-1, then they correspond to the string

niño

Looking at the ISO-8859-1 codepage, 0xc3 maps to Â, and 0xb1 maps to ±.

So what's happening on your local machine is that your app is receiving the five bytes 0x6e 0x69 0xc3 0xb1 0x6f from Redis, which is the UTF-8 representation of "niño". On Heroku it's receiving the four bytes 0x6e 0x69 0xf1 0x6f, which is the ISO-8859-1 representation.

The real fix to your problem will be to make sure the strings being put into Redis are all already UTF-8 (or at least all the same encoding). I haven't used Redis, but from what I can tell from a brief Google, it doesn't concern itself with string encodings but simply gives back whatever bytes it's been given. You should look at whatever process is putting the data into Redis, and ensure that it handles the encoding properly.

like image 123
matt Avatar answered Nov 16 '22 02:11

matt