Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : Reading first n rows from parquet file?

I have a parquet file and I want to read first n rows from the file into a pandas data frame. What I tried:

df = pd.read_parquet(path= 'filepath', nrows = 10)

It did not work and gave me error:

TypeError: read_table() got an unexpected keyword argument 'nrows'

I did try the skiprows argument as well but that also gave me same error.

Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.

Is there any way to achieve it?

like image 694
Sanchit Kumar Avatar asked Dec 31 '18 01:12

Sanchit Kumar


People also ask

How do I read a snappy Parquet file in Python?

You can use pandas to read snppay. parquet files into a python pandas dataframe.

Is parquet faster than CSV pandas?

For data analysis with Python, we all use Pandas widely. In this article, we will show that using Parquet files with Apache Arrow gives you an impressive speed advantage compared to using CSV files with Pandas while reading the content of large files.

What is Read_parquet?

Source: R/parquet.R. read_parquet.Rd. 'Parquet' is a columnar storage file format. This function enables you to read Parquet files into R.


3 Answers

After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows or skiprows while reading the parquet file.

The reason being that pandas use pyarrow or fastparquet parquet engines to process parquet file and pyarrow has no support for reading file partially or reading file by skipping rows (not sure about fastparquet). Below is the link of issue on pandas github for discussion.

https://github.com/pandas-dev/pandas/issues/24511

like image 156
Sanchit Kumar Avatar answered Oct 13 '22 03:10

Sanchit Kumar


The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.

To read using PyArrow as the backend, follow below:

from pyarrow.parquet import ParquetFile
import pyarrow as pa 

pf = ParquetFile('file_name.pq') 
first_ten_rows = next(pf.iter_batches(batch_size = 10)) 
df = pa.Table.from_batches([first_ten_rows]).to_pandas() 

Change the line batch_size = 10 to match however many rows you want to read in.

like image 41
David Kaftan Avatar answered Oct 13 '22 02:10

David Kaftan


Parquet file is column oriented storage, designed for that... So it's normal to load all the file to access just one line.

like image 20
B. M. Avatar answered Oct 13 '22 03:10

B. M.