For example, pandas's read_csv
has a chunk_size
argument which allows the read_csv
to return an iterator on the CSV file so we can read it in chunks.
The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks like read_csv
.
Is there a way to read parquet files in chunks?
We can always read the parquet file to a dataframe in Spark and see the content. They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.
Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster).
You can query Parquet files the same way you read CSV files.
If your parquet file was not created with row groups, the read_row_group method doesn't seem to work (there is only one group!).
However if your parquet file is partitioned as a directory of parquet files you can use the fastparquet engine, which only works on individual files, to read files then, concatenate the files in pandas or get the values and concatenate the ndarrays
import pandas as pd
from glob import glob
files = sorted(glob('dat.parquet/part*'))
data = pd.read_parquet(files[0],engine='fastparquet')
for f in files[1:]:
data = pd.concat([data,pd.read_parquet(f,engine='fastparquet')])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With