I'm attempting to stream a .gz file from S3 using boto and iterate over the lines of the unzipped text file. Mysteriously, the loop never terminates; when the entire file has been read, the iteration restarts at the beginning of the file.
Let's say I create and upload an input file like the following:
> echo '{"key": "value"}' > foo.json
> gzip -9 foo.json
> aws s3 cp foo.json.gz s3://my-bucket/my-location/
and I run the following Python script:
import boto
import gzip
connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = bucket.get_key('my-location/foo.json.gz')
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
print(line)
The result is:
b'{"key": "value"}\n'
b'{"key": "value"}\n'
b'{"key": "value"}\n'
...forever...
Why is this happening? I think there must be something very basic that I am missing.
Ah, boto. The problem is that the read method redownloads the key if you call it after the key has been completely read once (compare the read and next methods to see the difference).
This isn't the cleanest way to do it, but it solves the problem:
import boto
import gzip
class ReadOnce(object):
def __init__(self, k):
self.key = k
self.has_read_once = False
def read(self, size=0):
if self.has_read_once:
return b''
data = self.key.read(size)
if not data:
self.has_read_once = True
return data
connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = ReadOnce(bucket.get_key('my-location/foo.json.gz'))
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
print(line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With