How to open huge parquet file using Pandas without enough RAM

Question

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).

Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.

I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.

Andrea · Accepted Answer

Do you need all the columns? You might be able to save memory by just loading the ones you actually use.

A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.

How to open huge parquet file using Pandas without enough RAM

Tags:

python

pandas

parquet

pyarrow

fastparquet

qxzsilver

1 Answers

Andrea

Recent Activity

Donate For Us

How to open huge parquet file using Pandas without enough RAM

Tags:

python

pandas

parquet

pyarrow

fastparquet

qxzsilver

1 Answers

Andrea

Related questions

Recent Activity

Donate For Us