I am trying to figure out what is the fastest way to write a LARGE pandas DataFrame to S3 filesystem. I am currently trying two ways:
1) Through gzip compression (BytesIO) and boto3
gz_buffer = BytesIO()
with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)
s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, s3_path + name_zip)
s3_object.put(Body=gz_buffer.getvalue())
which for a dataframe of 7M rows takes around 420seconds to write to S3.
2) Through writing to csv file without compression (StringIO buffer)
csv_buffer = StringIO()
data.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, s3_path + name_csv).put(Body=csv_buffer.getvalue())
which takes around 371 seconds...
The question is: Is there any other faster way to write a pandas dataframe to S3?
Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.
import boto3
s3 = boto3.client('s3')
csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')
# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)
The docs for s3.upload_fileobj
are here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj
It really depends on the content, but that is not related to boto3
. Try first to dump your DataFrame
locally and see what's fastest and what size you get.
Here are some suggestions that we have found to be fast, for cases between a few MB to over 2GB (although, for more than 2GB, you really want parquet and possibly split it into a parquet dataset):
Lots of mixed text/numerical data (SQL-oriented content): use df.to_parquet(file)
.
Mostly numerical data (e.g. if your columns df.dtypes
indicate a happy numpy
array of a single type, not Object
): you can try df_to_hdf(file, 'key')
.
One bit of advice: try to split your df
in some shards that are meaningful to you (e.g., by time for timeseries). Especially if you have a lot of updates to a single shard (e.g. the last one in a time series), it will make your download/upload much faster.
What we have found is that, HDF5 are bulkier (uncompressed), but they save/load fantastically fast from/into memory. Parquets are by default snappy-compressed, so they tend to be smaller (depending on the entropy of your data, of course; penalty for you if you save totally random numbers).
For boto3
client, both multipart_chunksize
and multipart_threshold
are 8MB by default, which is often a fine choice. You can check via:
tc = boto3.s3.transfer.TransferConfig()
print(f'chunksize: {tc.multipart_chunksize}, threshold: {tc.multipart_threshold}')
Also, the default is to use 10 threads for each upload (which does nothing unless the size of your object is larger than the threshold above).
Another question is how to upload many files efficiently. That is not handled by any definition in TransferConfig
. But I digress, the original question is about a single object.
You can try using s3fs
with pandas
compression to upload to S3. StringIO
or BytesIO
are memory hogging.
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(anon=False)
df = pd.read_csv("some_large_file")
with s3.open('s3://bucket/file.csv.gzip','w') as f:
df.to_csv(f, compression='gzip')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With