Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why reading a parquet dataset requires much more memory than the size of the dataset?

I'm trying to read a parquet dataset from S3 in Python using pyarrow. The S3 UI says that the size of this path is 14.3 GB, with 836 objects total. I'm running the code on a c4.8xlarge EC2 instance, which has 64GB of RAM. Despite having more than 4x as much RAM as the dataset's size, my machine runs out of memory and the program crashes.

Why does reading this dataset require so much memory? Is there a way to avoid this problem? I am aware of distributed computing libraries like Spark and Dask, and am able to make use of this dataset just fine in PySpark, but I'm trying to set up a single-machine workflow.

Here is the code I used to to read the dataset:

import pyarrow.parquet as pq
from pyarrow import fs
s3 = fs.S3FileSystem()


#fs = s3fs.S3FileSystem()
bucket = "<bucket_name>"
path = "<path>"

dataset = pq.ParquetDataset(f"{bucket}/{path}", filesystem=s3)

And here is a summary of the schema + some stats. I'm reading 9 columns out of 113, and there are 7,045,204 rows:

Column 1: int
Column 2: Array<int>, average len around 450
Column 3: Array<int>, average len around 450
Column 4: Array<int>, average len around 1000
Column 5: Array<int>, average len around 1000
Column 6: String, average len of 2
Column 7: int
Column 8: int
Column 9: timestamp
like image 266
user12138762 Avatar asked Oct 20 '25 01:10

user12138762


1 Answers

The answer to the question "Why is the loaded parquet bigger than on disk" is compression as @michael-delgado explained in the comments.

A workaround for your situation is using the Arrow dataset api via either pyarrow.dataset.dataset or by setting use_legacy_dataset=False if you want to use ParquetDataset. More detailed information here

like image 135
assignUser Avatar answered Oct 22 '25 14:10

assignUser