Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to save a large pandas DataFrame to S3?

I am trying to figure out what is the fastest way to write a LARGE pandas DataFrame to S3 filesystem. I am currently trying two ways:

1) Through gzip compression (BytesIO) and boto3

gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, s3_path + name_zip)
s3_object.put(Body=gz_buffer.getvalue())

which for a dataframe of 7M rows takes around 420seconds to write to S3.

2) Through writing to csv file without compression (StringIO buffer)

csv_buffer = StringIO()
data.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, s3_path + name_csv).put(Body=csv_buffer.getvalue())

which takes around 371 seconds...

The question is: Is there any other faster way to write a pandas dataframe to S3?

like image 792
pmanresa93 Avatar asked Mar 28 '19 18:03

pmanresa93


3 Answers

Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.

import boto3
s3 = boto3.client('s3')

csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')

# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)

The docs for s3.upload_fileobj are here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj

like image 51
jesterhazy Avatar answered Oct 26 '22 20:10

jesterhazy


It really depends on the content, but that is not related to boto3. Try first to dump your DataFrame locally and see what's fastest and what size you get.

Here are some suggestions that we have found to be fast, for cases between a few MB to over 2GB (although, for more than 2GB, you really want parquet and possibly split it into a parquet dataset):

  1. Lots of mixed text/numerical data (SQL-oriented content): use df.to_parquet(file).

  2. Mostly numerical data (e.g. if your columns df.dtypes indicate a happy numpy array of a single type, not Object): you can try df_to_hdf(file, 'key').

One bit of advice: try to split your df in some shards that are meaningful to you (e.g., by time for timeseries). Especially if you have a lot of updates to a single shard (e.g. the last one in a time series), it will make your download/upload much faster.

What we have found is that, HDF5 are bulkier (uncompressed), but they save/load fantastically fast from/into memory. Parquets are by default snappy-compressed, so they tend to be smaller (depending on the entropy of your data, of course; penalty for you if you save totally random numbers).

For boto3 client, both multipart_chunksize and multipart_threshold are 8MB by default, which is often a fine choice. You can check via:

tc = boto3.s3.transfer.TransferConfig()
print(f'chunksize: {tc.multipart_chunksize}, threshold: {tc.multipart_threshold}')

Also, the default is to use 10 threads for each upload (which does nothing unless the size of your object is larger than the threshold above).

Another question is how to upload many files efficiently. That is not handled by any definition in TransferConfig. But I digress, the original question is about a single object.

like image 28
Pierre D Avatar answered Oct 26 '22 20:10

Pierre D


You can try using s3fs with pandas compression to upload to S3. StringIO or BytesIO are memory hogging.

import s3fs
import pandas as pd

s3 = s3fs.S3FileSystem(anon=False)
df = pd.read_csv("some_large_file")
with s3.open('s3://bucket/file.csv.gzip','w') as f:
    df.to_csv(f, compression='gzip')
like image 37
raj Avatar answered Oct 26 '22 21:10

raj