The use case is the following:
I've been trying to do step two in-memory (without having to store the file to disk in order to get the parquet format), but all the libraries I've seen so far, they always write to disk.
So I have the following questions:
Apache Arrow and the pyarrow library should solve this and does much of the processing in memory. In pandas
you can read/write parquet files via pyarrow
.
Some example code that also leverages smart_open as well.
import pandas as pd
import boto3
from smart_open import open
from io import BytesIO
s3 = boto3.client('s3')
# read parquet file into memory
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_parquet(BytesIO(obj['Body'].read()), engine='pyarrow')
# do stuff with dataframe
# write parquet file to s3 out of memory
with open(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}', 'wb') as out_file:
df.to_parquet(out_file, engine='pyarrow', index=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With