Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming decompression of S3 gzip source object to a S3 destination object using python?

Tags:

Given a large gzip object in S3, what is a memory efficient (e.g. streaming) method in python3/boto3 to decompress the data and store the results back into another S3 object?

There is a similar question previously asked. However, all of the answers use a methodology in which the contents of the gzip file are first read into memory (e.g. ByteIO). These solutions are not viable for objects that are too big to fit in main memory.

For large S3 objects the contents need to be read, decompressed "on the fly", and then written to a different S3 object is some chunked fashion.

Thank you in advance for your consideration and response.

like image 981
Ramón J Romero y Vigil Avatar asked Oct 20 '20 17:10

Ramón J Romero y Vigil


1 Answers

You can use streaming methods with boto / s3 but you have to define your own file-like objects AFAIK.
Luckily there's smart_open which handles that for you; it also supports GCS, Azure, HDFS, SFTP and others.
Here's an example using a large sample of sales data:

import boto3
from smart_open import open

session = boto3.Session()  # you need to set auth credentials here if you don't have them set in your environment
chunk_size = 1024 * 1024  # 1 MB
f_in = open("s3://mybucket/2m_sales_records.csv.gz", transport_params=dict(session=session), encoding="utf-8")
f_out = open("s3://mybucket/2m_sales_records.csv", "w", transport_params=dict(session=session))
byte_count = 0
while True:
    data = f_in.read(chunk_size)
    if not data:
        break
    f_out.write(data)
    byte_count += len(data)
    print(f"wrote {byte_count} bytes so far")
f_in.close()
f_out.close()

The sample file has 2 million lines and it's 75 MB compressed and 238 MB uncompressed.
I uploaded the compressed file to mybucket and ran the code which downloaded the file, extracted the contents in memory and uploaded the uncompressed data back to S3.
On my computer the process took around 78 seconds (highly dependent on Internet connection speed) and never used more than 95 MB of memory; I think you can lower the memory requirements if need be by overriding the part size for S3 multipart uploads in smart_open.

DEFAULT_MIN_PART_SIZE = 50 * 1024**2
"""Default minimum part size for S3 multipart uploads"""
MIN_MIN_PART_SIZE = 5 * 1024 ** 2
"""The absolute minimum permitted by Amazon."""
like image 174
Ionut Ticus Avatar answered Oct 11 '22 18:10

Ionut Ticus