Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to gzip while uploading into s3 using boto

I have a large local file. I want to upload a gzipped version of that file into S3 using the boto library. The file is too large to gzip it efficiently on disk prior to uploading, so it should be gzipped in a streamed way during the upload.

The boto library knows a function set_contents_from_file() which expects a file-like object it will read from.

The gzip library knows the class GzipFile which can get an object via the parameter named fileobj; it will write to this object when compressing.

I'd like to combine these two functions, but the one API wants to read by itself, the other API wants to write by itself; neither knows a passive operation (like being written to or being read from).

Does anybody have an idea on how to combine these in a working fashion?

EDIT: I accepted one answer (see below) because it hinted me on where to go, but if you have the same problem, you might find my own answer (also below) more helpful, because I implemented a solution using multipart uploads in it.

like image 547
Alfe Avatar asked Apr 02 '13 01:04

Alfe


1 Answers

I implemented the solution hinted at in the comments of the accepted answer by garnaat:

import cStringIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = cStringIO.StringIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with file(fileName) as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

It seems to work without problems. And after all, streaming is in most cases just a chunking of the data. In this case, the chunks are about 10MB large, but who cares? As long as we aren't talking about several GB chunks, I'm fine with this.


Update for Python 3:

from io import BytesIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = BytesIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with open(fileName, "rb") as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()
like image 103
Alfe Avatar answered Oct 15 '22 09:10

Alfe