Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to open huge parquet file using Pandas without enough RAM

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).

Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.

I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.

like image 325
qxzsilver Avatar asked Feb 11 '20 03:02

qxzsilver


1 Answers

Do you need all the columns? You might be able to save memory by just loading the ones you actually use.

A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.

like image 166
Andrea Avatar answered Sep 18 '22 04:09

Andrea