Pandas dataframe to parquet buffer in memory

Question

The use case is the following:

Read data from external database and load it into pandas dataframe
Transform that dataframe into parquet format buffer
Upload that buffer to s3

I've been trying to do step two in-memory (without having to store the file to disk in order to get the parquet format), but all the libraries I've seen so far, they always write to disk.

So I have the following questions:

Wouldn't it be more performant if the conversion was done in-memory since you don't have to deal with I/O disk overhead?
As you increase the concurrent processes converting files and storing them into disk, couldn't we have issues regarding disk such as running out of space at some points or reaching throughput limit of the disk ?

JD D · Accepted Answer

Apache Arrow and the pyarrow library should solve this and does much of the processing in memory. In pandas you can read/write parquet files via pyarrow.

Some example code that also leverages smart_open as well.

import pandas as pd
import boto3
from smart_open import open
from io import BytesIO

s3 = boto3.client('s3')

# read parquet file into memory
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_parquet(BytesIO(obj['Body'].read()), engine='pyarrow')

# do stuff with dataframe

# write parquet file to s3 out of memory
with open(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}', 'wb') as out_file:
    df.to_parquet(out_file, engine='pyarrow', index=False)

Pandas dataframe to parquet buffer in memory

Tags:

performance

python

memory-management

pandas

parquet

JaviOverflow

1 Answers

JD D

Recent Activity

Donate For Us

Pandas dataframe to parquet buffer in memory

Tags:

performance

python

memory-management

pandas

parquet

JaviOverflow

1 Answers

JD D

Related questions

Recent Activity

Donate For Us