I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe
? Ultimately, I would like to have the data in dataframe
format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow
and fastparquet
but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow
this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile
and its read_row_group
method
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With