Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Infinite loop when streaming a .gz file from S3 using boto

I'm attempting to stream a .gz file from S3 using boto and iterate over the lines of the unzipped text file. Mysteriously, the loop never terminates; when the entire file has been read, the iteration restarts at the beginning of the file.

Let's say I create and upload an input file like the following:

> echo '{"key": "value"}' > foo.json
> gzip -9 foo.json
> aws s3 cp foo.json.gz s3://my-bucket/my-location/

and I run the following Python script:

import boto
import gzip

connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = bucket.get_key('my-location/foo.json.gz')
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
    print(line)

The result is:

b'{"key": "value"}\n'
b'{"key": "value"}\n'
b'{"key": "value"}\n'
...forever...

Why is this happening? I think there must be something very basic that I am missing.

like image 612
zweiterlinde Avatar asked Jun 05 '15 21:06

zweiterlinde


1 Answers

Ah, boto. The problem is that the read method redownloads the key if you call it after the key has been completely read once (compare the read and next methods to see the difference).

This isn't the cleanest way to do it, but it solves the problem:

import boto
import gzip

class ReadOnce(object):
    def __init__(self, k):
        self.key = k
        self.has_read_once = False

   def read(self, size=0):
       if self.has_read_once:
           return b''
       data = self.key.read(size)
       if not data:
           self.has_read_once = True
       return data

connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = ReadOnce(bucket.get_key('my-location/foo.json.gz'))
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
    print(line)
like image 112
zweiterlinde Avatar answered Sep 22 '22 11:09

zweiterlinde