Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I get a string encoding issue "\xE2" from ASCII-8BIT to UTF-8?

I'm trying to download a PDF from an email and write the contents to a file. For some reason, I'm getting this error:

An Encoding::UndefinedConversionError occurred in attachments#inbound: "\xE2" from ASCII-8BIT to UTF-8 app/controllers/api/attachments_controller.rb:70:in `write'

Here's my code:

def inbound
    if Rails.env.production? or Rails.env.staging?
      email = Postmark::Mitt.new(request.body.read)
    else
      email = Postmark::Mitt.new(File.binread "#{Rails.root}/app/temp_pdfs/email.json")
    end

    if email.attachments.count == 0
      # notify aidin that we got an inbound email with no attachments
      respond_to do |format|
        format.json { head :no_content }
      end
      return
    end
    attachment = email.attachments.first
    filename = "attachment" + (Time.now.strftime("%Y%m%d%H%M%S")+(rand * 1000000).round.to_s) + ".pdf"
    base_path = "#{Rails.root}/temp_attachments/"
    unless File.directory?(base_path)
      Dir::mkdir(base_path)
    end
    file = File.new base_path + filename, 'w+'
    file.write Base64.decode64(attachment.source['Content'].encode("UTF-16BE", :invalid=>:replace, :replace=>"?").encode("UTF-8"))
    file.close
    write_options = write_options()
    write_options[:metadata] = {:filename => attachment.file_name, :content_type => attachment.content_type, :size => attachment.size }

    obj = s3_object()
    file = File.open file.path
    obj.write(file.read, write_options)
    file.close

    FaxAttach.trigger obj.key.split('/').last

    render :nothing => true, :status => 202 and return
  end

I read around and it looked like the way to solve this was:

file.write Base64.decode64(attachment.source['Content'].encode("UTF-16BE", :invalid=>:replace, :replace=>"?").encode("UTF-8"))

but it doesn't seem to work.

like image 731
chintanparikh Avatar asked Jun 25 '13 14:06

chintanparikh


People also ask

Can ASCII files be read as UTF-8?

You can read any ASCII-encoded document as UTF-8, and it will work. ASCII only uses 7 bits, and UTF-8 uses the unused eight bit to mark non-ASCII code units.

What is a UTF-8 encoded string?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

What is ASCII 8bit encoding?

ASCII is an 8-bit code. That is, it uses eight bits to represent a letter or a punctuation mark. Eight bits are called a byte. A binary code with eight digits, such as 1101 10112, can be stored in one byte of computer memory.

Is ASCII or UTF-8 more efficient?

There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD). If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values. Save this answer.


1 Answers

The error message is actually being thrown on the file write, not by your encode/decode inside the params, because Ruby is trying to apply default character encoding on file.write. To prevent this, the quickest fix is to add the b flag when you open the file

file = File.new base_path + filename, 'wb+'
file.write Base64.decode64( attachment.source['Content'] )

That's assuming the incoming attachment is encoded in Base64, as your code implies (I have no way to verify this). The Base64 encoding stored inside attachment.source['Content'] should be the same bytes in ASCII-8BIT and UTF-8, so there is no point converting it inside the call to decode64.

like image 60
Neil Slater Avatar answered Sep 21 '22 18:09

Neil Slater