Streaming parquet file python and only downsampling

Question

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated!

Wes McKinney · Accepted Answer

Spark is certainly a viable choice for this task.

We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method

Streaming parquet file python and only downsampling

Tags:

python-3.x

parquet

pyarrow

fastparquet

Sjoseph

1 Answers

Wes McKinney

Recent Activity

Donate For Us

Streaming parquet file python and only downsampling

Tags:

python-3.x

parquet

pyarrow

fastparquet

Sjoseph

1 Answers

Wes McKinney

Related questions

Recent Activity

Donate For Us