Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character encoding with Ruby 1.9.3 and the mail gem

I'm trying to parse email strings with the Ruby mail gem, and I'm having a devil of a time with character encodings. Take the following email:

MIME-Version: 1.0
Sender: [email protected]
Received: by 10.142.239.17 with HTTP; Thu, 14 Jun 2012 06:00:18 -0700 (PDT)
Date: Thu, 14 Jun 2012 09:00:18 -0400
Delivered-To: [email protected]
X-Google-Sender-Auth: MxfFrMybNjBoBt4O4GwAn9cMsko
Message-ID: <CAGErOzF3FV5NvzN3zUpLGPok96SFzK18Z4HerzyYNALnzgMVaA@mail.gmail.com>
Subject: Re: [Lorem Ipsum] Foo updated the forum topic 'Reply by email test'
From: Foo Bar <[email protected]>
To: Foo <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

This email has accents:=A0R=E9sum=E9
>
> --------- Reply Above This Line ------------
>
> Email parsing with accents: R=E9sum=E9
>
> Click here to view this post in your browser

The email body, when properly encoded, should be:

This reply has accents: Résumé
>
> --------- Reply Above This Line ------------
>
> Email parsing with accents: Résumé
>
> Click here to view this post in your browser

However, I'm having a devil of a time actually getting the accent marks to come through. Here's what I've tried:

message = Mail.new(email_string)
body = message.body.decoded

That gets me a string that starts like this:

This reply has accents:\xA0R\xE9sum\xE9\r\n>\r\n> --------- Reply Above This Line ------------

Finally, I try this:

body.encoding # => <Encoding:ASCII-8BIT>
body.encode("UTF-8") # => Encoding::UndefinedConversionError: "\xA0" from ASCII-8BIT to UTF-8

Does anyone have any suggestions on how to deal with this? I'm pretty sure it has to do with the "charset=ISO-8859-1" setting in the email, but I'm not sure how to use that, or if there's a way to easily extract that using the mail gem.

like image 838
Micah Avatar asked Jun 14 '12 18:06

Micah


1 Answers

After playing a bit, I found this:

body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part

You can extract the charset from the message like so.

message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset

Be careful with non-multipart, as the following can cause trouble:

body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...

body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)
like image 161
Micah Avatar answered Sep 28 '22 06:09

Micah