What is the fastest way to save a large pandas DataFrame to S3?

Question

I am trying to figure out what is the fastest way to write a LARGE pandas DataFrame to S3 filesystem. I am currently trying two ways:

1) Through gzip compression (BytesIO) and boto3

gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, s3_path + name_zip)
s3_object.put(Body=gz_buffer.getvalue())

which for a dataframe of 7M rows takes around 420seconds to write to S3.

2) Through writing to csv file without compression (StringIO buffer)

csv_buffer = StringIO()
data.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, s3_path + name_csv).put(Body=csv_buffer.getvalue())

which takes around 371 seconds...

The question is: Is there any other faster way to write a pandas dataframe to S3?

jesterhazy · Accepted Answer

Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.

import boto3
s3 = boto3.client('s3')

csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')

# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)

The docs for s3.upload_fileobj are here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj

Pierre D · Answer

It really depends on the content, but that is not related to boto3. Try first to dump your DataFrame locally and see what's fastest and what size you get.

Here are some suggestions that we have found to be fast, for cases between a few MB to over 2GB (although, for more than 2GB, you really want parquet and possibly split it into a parquet dataset):

Lots of mixed text/numerical data (SQL-oriented content): use df.to_parquet(file).
Mostly numerical data (e.g. if your columns df.dtypes indicate a happy numpy array of a single type, not Object): you can try df_to_hdf(file, 'key').

One bit of advice: try to split your df in some shards that are meaningful to you (e.g., by time for timeseries). Especially if you have a lot of updates to a single shard (e.g. the last one in a time series), it will make your download/upload much faster.

What we have found is that, HDF5 are bulkier (uncompressed), but they save/load fantastically fast from/into memory. Parquets are by default snappy-compressed, so they tend to be smaller (depending on the entropy of your data, of course; penalty for you if you save totally random numbers).

For boto3 client, both multipart_chunksize and multipart_threshold are 8MB by default, which is often a fine choice. You can check via:

tc = boto3.s3.transfer.TransferConfig()
print(f'chunksize: {tc.multipart_chunksize}, threshold: {tc.multipart_threshold}')

Also, the default is to use 10 threads for each upload (which does nothing unless the size of your object is larger than the threshold above).

Another question is how to upload many files efficiently. That is not handled by any definition in TransferConfig. But I digress, the original question is about a single object.

raj · Answer

You can try using s3fs with pandas compression to upload to S3. StringIO or BytesIO are memory hogging.

import s3fs
import pandas as pd

s3 = s3fs.S3FileSystem(anon=False)
df = pd.read_csv("some_large_file")
with s3.open('s3://bucket/file.csv.gzip','w') as f:
    df.to_csv(f, compression='gzip')

What is the fastest way to save a large pandas DataFrame to S3?

Tags:

python-3.x

pandas

amazon-s3

pmanresa93

3 Answers

jesterhazy

Pierre D

raj

Recent Activity

Donate For Us

What is the fastest way to save a large pandas DataFrame to S3?

Tags:

python-3.x

pandas

amazon-s3

pmanresa93

3 Answers

jesterhazy

Pierre D

raj

Related questions

Recent Activity

Donate For Us