I'm trying to read a parquet dataset from S3 in Python using pyarrow. The S3 UI says that the size of this path is 14.3 GB, with 836 objects total. I'm running the code on a c4.8xlarge EC2 instance, which has 64GB of RAM. Despite having more than 4x as much RAM as the dataset's size, my machine runs out of memory and the program crashes.
Why does reading this dataset require so much memory? Is there a way to avoid this problem? I am aware of distributed computing libraries like Spark and Dask, and am able to make use of this dataset just fine in PySpark, but I'm trying to set up a single-machine workflow.
Here is the code I used to to read the dataset:
import pyarrow.parquet as pq
from pyarrow import fs
s3 = fs.S3FileSystem()
#fs = s3fs.S3FileSystem()
bucket = "<bucket_name>"
path = "<path>"
dataset = pq.ParquetDataset(f"{bucket}/{path}", filesystem=s3)
And here is a summary of the schema + some stats. I'm reading 9 columns out of 113, and there are 7,045,204 rows:
Column 1: int
Column 2: Array<int>, average len around 450
Column 3: Array<int>, average len around 450
Column 4: Array<int>, average len around 1000
Column 5: Array<int>, average len around 1000
Column 6: String, average len of 2
Column 7: int
Column 8: int
Column 9: timestamp
The answer to the question "Why is the loaded parquet bigger than on disk" is compression as @michael-delgado explained in the comments.
A workaround for your situation is using the Arrow dataset api via either pyarrow.dataset.dataset or by setting use_legacy_dataset=False if you want to use ParquetDataset. More detailed information here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With