Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

s3.upload_fileobj gives error a bytes-like object is required

My question is inspired by a previous SO about this topic: uploading and saving DataFrames as csv files in Amazon Web Services (AWS) S3. Using Python3, I would like to use s3.upload_fileobj – multi-part uploads – to make the data transfer to S3 faster. When I run the code in the accepted answer, I receive an error message : "TypeError: a bytes-like object is required, not 'str' ". .

The answer has recently been upvoted several times. So I think there must be a way to run this code without error in Python3.

Please find below the code. Let's for ease use a simple DataFrame. In reality this DataFrame is much bigger (at around 500 MB).

import pandas as pd
import io

df = pd.DataFrame({'A':[1,2,3], 'B':[6,7,8]})

The code is the following. I turned it for convenience in a function :

def upload_file(dataframe, bucket, key):
    """dat=DataFrame, bucket=bucket name in AWS S3, key=key name in AWS S3"""
    s3 = boto3.client('s3')
    csv_buffer = io.BytesIO()
    dataframe.to_csv(csv_buffer, compression='gzip')
    s3.upload_fileobj(csv_buffer, bucket, key)

upload_file(df, your-bucket, your-key)

Thank you very much for your advices!

like image 773
Ruthger Righart Avatar asked Jun 19 '19 07:06

Ruthger Righart


1 Answers

Going off this reference, it seems you'll need to wrap a gzip.GzipFile object around your BytesIO which will then perform the compression for you.

import io
import gzip

buffer = io.BytesIO()     
with gzip.GzipFile(fileobj=buffer, mode="wb") as f:
    f.write(df.to_csv().encode())
buffer.seek(0)

s3.upload_fileobj(buffer, bucket, key)

Minimal Verifiable Example

import io
import gzip
import zlib

# Encode
df = pd.DataFrame({'A':[1,2,3], 'B':[6,7,8]})

buffer = io.BytesIO()     
with gzip.GzipFile(fileobj=buffer, mode="wb") as f:
    f.write(df.to_csv().encode())

buffer.getvalue()
# b'\x1f\x8b\x08\x00\xf0\x0b\x11]\x02\xff\xd3q\xd4q\xe22\xd01\xd41\xe32\xd41\xd21\xe72\xd21\xd6\xb1\xe0\x02\x00Td\xc2\xf5\x17\x00\x00\x00'

# Decode
print(zlib.decompress(out.getvalue(), 16+zlib.MAX_WBITS).decode())

# ,A,B
# 0,1,6
# 1,2,7
# 2,3,8
like image 61
cs95 Avatar answered Oct 13 '22 13:10

cs95