Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to loop large parquet file with generators in python?

Is it possible to open parquet files and iterate line by line, using generators? This is to avoid loading the whole parquet file into memory.

The content of the file is pandas DataFrame.

like image 890
Alpha Avatar asked Jun 08 '18 07:06

Alpha


2 Answers

You can not iterate by line as it is not the way it is stored. You can iterate through the row-groups as following:

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
for df in pf.iter_row_groups():
    process sub-data-frame df
like image 72
Liana Ziskind Avatar answered Oct 08 '22 02:10

Liana Ziskind


You can iterate using tensorflow_io.

import tensorflow_io as tfio

dataset = tfio.IODataset.from_parquet('myfile.parquet')

for line in dataset.take(3):
    # print the first 3 lines
    print(line)
like image 45
DanDy Avatar answered Oct 08 '22 03:10

DanDy