Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Net::FTP gettextfile with invalid characters (ASCII-8BIT vs UTF-8)

I have a process that fetches a flat file from a mainframe via FTP. This usually works fine, but every now and then the file will contain something an accent character. If I try to get a file containing an accent, the entire process fails with the following error: Encoding::UndefinedConversionError: "\x88" from ASCII-8BIT to UTF-8

That's using Net::FTP's gettextfile method. Many people suggest simply switching to getbinaryfile - doing so will allow me to download the file, but it the resulting file is something that I can no longer parse (says it's in UTF-8, but the contents make no sense).

Is there any way to simply fetch and save the file as ASCII without having rails automatically convert the output to UTF-8? Here's my code:

Net::FTP.open(config['host']) do |ftp|
  Rails.logger.info("FTP Connection established")

  ftp.login(config['user'], config['password'])
  Rails.logger.info("Login Successful")

  ftp.gettextfile("'#{config['es_in']}'", "data/es-in.#{Time.now.utc.strftime("%Y%m%d-%H%M%S")}")
  ftp.gettextfile("'#{config['ca_in']}'", "data/ca-in.#{Time.now.utc.strftime("%Y%m%d-%H%M%S")}")

  Rails.logger.info("Download(s) completed, terminating connection.")
end
like image 928
Alec Sanger Avatar asked May 14 '14 18:05

Alec Sanger


1 Answers

If I remember right, text files in FTP-dom are ASCII-7bit and can not contain characters with the upper-bit set, AKA ASCII-8BIT. Accented characters, even in extended ASCII or 8BIT or whatever we want to call anything above 0x7F, need to be transferred in binary mode.

From the FTP RFC:

   ASCII

     The ASCII character set is as defined in the ARPA-Internet
     Protocol Handbook.  In FTP, ASCII characters are defined to be
     the lower half of an eight-bit code set (i.e., the most
     significant bit is zero).

So yes, you should probably use getbinaryfile instead.

The main practical difference between the two is that binary mode won't do line-end translations. If the source system is ECDIC-based or an alternate word-size, gettextfile will translate the file on the fly to ASCII. Encountering characters that are not in the expected encoding could easily trigger the sort of problem you're seeing.

If the file makes no sense after transferring using getbinaryfile, it could be in an alternate codeset than UTF8 on the mainframe. You'll have to figure out what codeset it is in on that system and open the file with the appropriate encoding settings after downloading. You can use the file command on *nix systems to make an educated guess about a file's encoding, but it's not an exhaustive test and can be mislead. Because the file is coming from a mainframe, it could be using a different word-size like UTF-16BE, UTF-32LE or be encoded in EBCDIC. This is where dealing with alternate OSes and hardware gets really annoying.

Without examples of the text, the first two bytes of the file, and a sampling of the text in a hex-dump, it's hard to help you.

And, after all that, it might be easier to use cURL, or the Curb gem to retrieve the file. cURL is very flexible and powerful and might give you the tools you need.

like image 55
the Tin Man Avatar answered Oct 19 '22 10:10

the Tin Man