I have a process that fetches a flat file from a mainframe via FTP. This usually works fine, but every now and then the file will contain something an accent character. If I try to get a file containing an accent, the entire process fails with the following error: Encoding::UndefinedConversionError: "\x88" from ASCII-8BIT to UTF-8
That's using Net::FTP
's gettextfile
method. Many people suggest simply switching to getbinaryfile
- doing so will allow me to download the file, but it the resulting file is something that I can no longer parse (says it's in UTF-8, but the contents make no sense).
Is there any way to simply fetch and save the file as ASCII without having rails automatically convert the output to UTF-8? Here's my code:
Net::FTP.open(config['host']) do |ftp|
Rails.logger.info("FTP Connection established")
ftp.login(config['user'], config['password'])
Rails.logger.info("Login Successful")
ftp.gettextfile("'#{config['es_in']}'", "data/es-in.#{Time.now.utc.strftime("%Y%m%d-%H%M%S")}")
ftp.gettextfile("'#{config['ca_in']}'", "data/ca-in.#{Time.now.utc.strftime("%Y%m%d-%H%M%S")}")
Rails.logger.info("Download(s) completed, terminating connection.")
end
If I remember right, text files in FTP-dom are ASCII-7bit and can not contain characters with the upper-bit set, AKA ASCII-8BIT. Accented characters, even in extended ASCII or 8BIT or whatever we want to call anything above 0x7F, need to be transferred in binary mode.
From the FTP RFC:
ASCII
The ASCII character set is as defined in the ARPA-Internet
Protocol Handbook. In FTP, ASCII characters are defined to be
the lower half of an eight-bit code set (i.e., the most
significant bit is zero).
So yes, you should probably use getbinaryfile
instead.
The main practical difference between the two is that binary mode won't do line-end translations. If the source system is ECDIC-based or an alternate word-size, gettextfile
will translate the file on the fly to ASCII. Encountering characters that are not in the expected encoding could easily trigger the sort of problem you're seeing.
If the file makes no sense after transferring using getbinaryfile
, it could be in an alternate codeset than UTF8 on the mainframe. You'll have to figure out what codeset it is in on that system and open the file with the appropriate encoding settings after downloading. You can use the file
command on *nix systems to make an educated guess about a file's encoding, but it's not an exhaustive test and can be mislead. Because the file is coming from a mainframe, it could be using a different word-size like UTF-16BE, UTF-32LE or be encoded in EBCDIC. This is where dealing with alternate OSes and hardware gets really annoying.
Without examples of the text, the first two bytes of the file, and a sampling of the text in a hex-dump, it's hard to help you.
And, after all that, it might be easier to use cURL, or the Curb gem to retrieve the file. cURL is very flexible and powerful and might give you the tools you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With