Pandas : Reading first n rows from parquet file?

Tags:

I have a parquet file and I want to read first n rows from the file into a pandas data frame. What I tried:

df = pd.read_parquet(path= 'filepath', nrows = 10)

It did not work and gave me error:

TypeError: read_table() got an unexpected keyword argument 'nrows'

I did try the skiprows argument as well but that also gave me same error.

Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.

Is there any way to achieve it?

694

asked Dec 31 '18 01:12

Sanchit Kumar

3 Answers

After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows or skiprows while reading the parquet file.

The reason being that pandas use pyarrow or fastparquet parquet engines to process parquet file and pyarrow has no support for reading file partially or reading file by skipping rows (not sure about fastparquet). Below is the link of issue on pandas github for discussion.

https://github.com/pandas-dev/pandas/issues/24511

156

answered Oct 13 '22 03:10

Sanchit Kumar

The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.

To read using PyArrow as the backend, follow below:

from pyarrow.parquet import ParquetFile
import pyarrow as pa 

pf = ParquetFile('file_name.pq') 
first_ten_rows = next(pf.iter_batches(batch_size = 10)) 
df = pa.Table.from_batches([first_ten_rows]).to_pandas()

Change the line batch_size = 10 to match however many rows you want to read in.

answered Oct 13 '22 02:10

David Kaftan

Parquet file is column oriented storage, designed for that... So it's normal to load all the file to access just one line.

answered Oct 13 '22 03:10

B. M.

Related questions
                            
                                Allowing resizing window pyGame
                            
                                Nearest Neighbor Search: Python
                            
                                Exception Value:failed to find libmagic. Check your installation in windows 7
                            
                                Conditional mocking: Call original function if condition does match
                            
                                How can i use signals in django bulk create
                            
                                Importance of apps orders in INSTALLED_APPS
                            
                                Using Jupyter behind a proxy
                            
                                Confusion re: pandas copy of slice of dataframe warning
                            
                                Equivalent for LinkedHashMap in Python
                            
                                Is there a direct equivalent in Java for Python's str.join? [duplicate]
                            
                                Annoying white space in bar chart (matplotlib, Python)
                            
                                Fit a non-linear function to data/observations with pyMCMC/pyMC
                            
                                Performance of subprocess.check_output vs subprocess.call
                            
                                Pandas - pandas.DataFrame.from_csv vs pandas.read_csv
                            
                                Rounding to nearest int with numpy.rint() not consistent for .5
                            
                                numpy array dtype is coming as int32 by default in a windows 10 64 bit machine
                            
                                Pandas: Merge data frames on datetime index
                            
                                Flask('application') versus Flask(__name__)
                            
                                Merging results from model.predict() with original pandas DataFrame?
                            
                                Print numpy array without ellipsis

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas : Reading first n rows from parquet file?

Tags:

python

pandas

parquet

Sanchit Kumar

People also ask

3 Answers

Sanchit Kumar

David Kaftan

B. M.

Recent Activity

Donate For Us