I have two CSV files stored on S3. When I open
one of them, a File
is returned. When I open
the other one, a StringIO
is returned.
fn1 #=> "http://SOMEWHERE.s3.amazonaws.com/setup_data/d1/file1.csv"
open(fn1) #=> #<File:/var/folders/sm/k7kyd0ns4k9bhfy7yqpjl2mh0000gn/T/open-uri20140814-26070-11cyjn1>
fn2 #=> "http://SOMEWHERE.s3.amazonaws.com/setup_data/d2/d3/file2.csv"
open(fn2) #=> #<StringIO:0x007f9718670ff0>
Why? Is there any way to open them with a consistent data type?
I need to pass the same data type String
into CSV.read(open(file_url))
, which doesn't work if sometimes it's getting a File
and sometimes a StringIO
.
They were created via different ruby scripts (they contain very different data).
On my Mac, they both appear to be ordinary text CSV files, and they were uplaoded via the AWS console, and have identical permissions and identical meta data (content-type: application/octet-stream).
This is by design. A tempfile is created if the size of the object is greater than 10240 bytes. From the source:
StringMax = 10240
def <<(str)
@io << str
@size += str.length
if StringIO === @io && StringMax < @size
require 'tempfile'
io = Tempfile.new('open-uri')
io.binmode
Meta.init io, @io if Meta === @io
io << @io.string
@io = io
end
end
If you need a StringIO
object, you could use fastercsv
.
CSV::read
expects a file path as it’s argument, not an already opened IO object. It will then open the file and read the contents. Your code works for the Tempfile case because Ruby calls to_path
behind the scenes on anything passed to File::open
, and File
s respond to this method. What happens is CSV opens another IO on the same file.
Rather than use CSV::read
, you could create a new CSV object and call read
on that (the instance method, not the class method). CSV:new
handles IO objects correctly:
CSV.new(open(file_url)).read
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With