Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I stream parquet using pyarrow?

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.

How can I use Pyarrow to do this?

like image 972
tSchema Avatar asked Oct 23 '25 20:10

tSchema


1 Answers

At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

like image 148
Wes McKinney Avatar answered Oct 27 '25 02:10

Wes McKinney



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!