Why reading a parquet dataset requires much more memory than the size of the dataset?

Question

I'm trying to read a parquet dataset from S3 in Python using pyarrow. The S3 UI says that the size of this path is 14.3 GB, with 836 objects total. I'm running the code on a c4.8xlarge EC2 instance, which has 64GB of RAM. Despite having more than 4x as much RAM as the dataset's size, my machine runs out of memory and the program crashes.

Why does reading this dataset require so much memory? Is there a way to avoid this problem? I am aware of distributed computing libraries like Spark and Dask, and am able to make use of this dataset just fine in PySpark, but I'm trying to set up a single-machine workflow.

Here is the code I used to to read the dataset:

import pyarrow.parquet as pq
from pyarrow import fs
s3 = fs.S3FileSystem()


#fs = s3fs.S3FileSystem()
bucket = "<bucket_name>"
path = "<path>"

dataset = pq.ParquetDataset(f"{bucket}/{path}", filesystem=s3)

And here is a summary of the schema + some stats. I'm reading 9 columns out of 113, and there are 7,045,204 rows:

Column 1: int
Column 2: Array<int>, average len around 450
Column 3: Array<int>, average len around 450
Column 4: Array<int>, average len around 1000
Column 5: Array<int>, average len around 1000
Column 6: String, average len of 2
Column 7: int
Column 8: int
Column 9: timestamp

assignUser · Accepted Answer

The answer to the question "Why is the loaded parquet bigger than on disk" is compression as @michael-delgado explained in the comments.

A workaround for your situation is using the Arrow dataset api via either pyarrow.dataset.dataset or by setting use_legacy_dataset=False if you want to use ParquetDataset. More detailed information here

Why reading a parquet dataset requires much more memory than the size of the dataset?

Tags:

python

out-of-memory

amazon-s3

parquet

pyarrow

user12138762

1 Answers

assignUser

Recent Activity

Donate For Us

Why reading a parquet dataset requires much more memory than the size of the dataset?

Tags:

python

out-of-memory

amazon-s3

parquet

pyarrow

user12138762

1 Answers

assignUser

Related questions

Recent Activity

Donate For Us