Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe to parquet buffer in memory

The use case is the following:

  1. Read data from external database and load it into pandas dataframe
  2. Transform that dataframe into parquet format buffer
  3. Upload that buffer to s3

I've been trying to do step two in-memory (without having to store the file to disk in order to get the parquet format), but all the libraries I've seen so far, they always write to disk.

So I have the following questions:

  • Wouldn't it be more performant if the conversion was done in-memory since you don't have to deal with I/O disk overhead?
  • As you increase the concurrent processes converting files and storing them into disk, couldn't we have issues regarding disk such as running out of space at some points or reaching throughput limit of the disk ?
like image 278
JaviOverflow Avatar asked Oct 23 '18 09:10

JaviOverflow


1 Answers

Apache Arrow and the pyarrow library should solve this and does much of the processing in memory. In pandas you can read/write parquet files via pyarrow.

Some example code that also leverages smart_open as well.

import pandas as pd
import boto3
from smart_open import open
from io import BytesIO

s3 = boto3.client('s3')

# read parquet file into memory
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_parquet(BytesIO(obj['Body'].read()), engine='pyarrow')

# do stuff with dataframe

# write parquet file to s3 out of memory
with open(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}', 'wb') as out_file:
    df.to_parquet(out_file, engine='pyarrow', index=False)

like image 106
JD D Avatar answered Oct 27 '22 10:10

JD D