Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to read parquet files in chunks?

Tags:

parquet

For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks.

The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks like read_csv.

Is there a way to read parquet files in chunks?

like image 433
xiaodai Avatar asked Nov 29 '19 04:11

xiaodai


People also ask

Can you view Parquet files?

We can always read the parquet file to a dataframe in Spark and see the content. They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.

Can Parquet files be compressed?

Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster).

Can you query Parquet files?

You can query Parquet files the same way you read CSV files.


1 Answers

If your parquet file was not created with row groups, the read_row_group method doesn't seem to work (there is only one group!).

However if your parquet file is partitioned as a directory of parquet files you can use the fastparquet engine, which only works on individual files, to read files then, concatenate the files in pandas or get the values and concatenate the ndarrays

import pandas as pd
from glob import glob
files = sorted(glob('dat.parquet/part*'))

data = pd.read_parquet(files[0],engine='fastparquet')
for f in files[1:]:
    data = pd.concat([data,pd.read_parquet(f,engine='fastparquet')])
like image 126
lee Avatar answered Oct 22 '22 12:10

lee