Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use Net::Http to download a file with UTF-8 characters in it?

I have an application where users can upload text-based files (xml, csv, txt) that are persisted to S3. Some of these files are pretty big. There are a variety of operations that need to be performed on the data in these files, so rather than read them from S3 and have it time out occasionally I download the files locally, then turn the operations loose on them.

Here's the code I use to download the file from S3. Upload is the name of the AR model I use to store this information. This method is an instance method on the Upload model:

def download
  basename = File.basename(self.text_file_name.path)
  filename = Rails.root.join(basename)
  host = MyFitment::Utility.get_host_without_www(self.text_file_name.url)
  Net::HTTP.start(host) do |http|
    f = open(filename)
    begin
      http.request_get(self.text_file_name.url) do |resp|
        resp.read_body do |segment|
          f.write(segment) # Fails when non-ASCII 8-bit characters are included.
        end
      end
    ensure
      f.close()
    end
  end
  filename

end

So you see that line above where the load fails. This code somehow thinks all files that are downloaded are encoded in ASCII 8-bit. How can I:

1) Check the encoding of a remote file like that 2) Download it and write it successfully.

Here's the error that is happening with a particular file right now:

Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8
from /Users/me/code/myapp/app/models/upload.rb:47:in `write'

Thank you for any help you can offer!

like image 545
AKWF Avatar asked Oct 21 '15 23:10

AKWF


1 Answers

How can I: 1) Check the encoding of a remote file like that.

You can check the Content-Type header of the response, which, if present, may look something like this:

Content-Type: text/plain; charset=utf-8

As you can see, the encoding is specified there. If there's no Content-Type header, or if the charset is not specified, or if the charset is specified incorrectly, then you can't know the encoding of the text. There are gems that can try to guess the encoding(with increasing accuracy), e.g. rchardet, charlock_holmes, but for complete accuracy, you have to know the encoding before reading the text.

This code somehow thinks all files that are downloaded are encoded in ASCII 8-bit.

In ruby, ASCII-8BIT is equivalent to binary, which means the Net::HTTP library just gives you a string containing a series of single bytes, and it's up to you to decide how to interpret those bytes.

If you want to interpret those bytes as UTF-8, then you do that with String#force_encoding():

text = text.force_encoding("UTF-8")

You might want to do that if, for instance, you want to do some regex matching on the string, and you want to match full characters(which might be multi-byte) rather than just single bytes.

Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8

Using String#encode('UTF-8') to convert ASCII-8BIT to UTF-8 doesn't work for bytes whose ascii codes are greater than 127:

(0..255).each do |ascii_code|
  str = ascii_code.chr("ASCII-8BIT")
  #puts str.encoding   #=>ASCII-8BIT

  begin
    str.encode("UTF-8")
  rescue Encoding::UndefinedConversionError
    puts "Can't encode char with ascii code #{ascii_code} to UTF-8."
  end

end

--output:--
Can't encode char with ascii code 128 to UTF-8.
Can't encode char with ascii code 129 to UTF-8.
Can't encode char with ascii code 130 to UTF-8.
...
...
Can't encode char with ascii code 253 to UTF-8.
Can't encode char with ascii code 254 to UTF-8.
Can't encode char with ascii code 255 to UTF-8.

Ruby just reads one byte at a time from the ASCII-8BIT string and tries to convert the character in the byte to UTF-8. So, while 128 may be a legal byte in UTF-8 when part of a multi-byte character sequence, 128 is not a legal UTF-8 character as a single byte.

As for writing the strings to a file, instead of this:

f = open(filename)

...if you want to output UTF-8 to the file, you would write:

f = open(filename, "w:UTF-8")

By default, ruby uses whatever the value of Encoding.default_external is to encode output to a file. The default_external encoding is pulled from your system's environment, or you can set it explicitly.

like image 59
7stud Avatar answered Oct 01 '22 01:10

7stud