How to read a Parquet file into Pandas DataFrame?

Tags:

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

304

asked Nov 19 '15 20:11

Daniel Mahler

2 Answers

pandas 0.21 introduces new functions for Parquet:

pd.read_parquet('example_pa.parquet', engine='pyarrow')

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

176

answered Sep 25 '22 10:09

chrisaycock

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

answered Sep 24 '22 10:09

danielfrg

Related questions
                            
                                sklearn plot confusion matrix with labels
                            
                                Django check for any exists for a query
                            
                                Mac + virtualenv + pip + postgresql = Error: pg_config executable not found
                            
                                Unable to install pyodbc on Linux
                            
                                How do you validate a URL with a regular expression in Python?
                            
                                Is there a need for range(len(a))?
                            
                                Installing multiple versions of a package with pip
                            
                                Explain __dict__ attribute [duplicate]
                            
                                How can I analyze Python code to identify problematic areas?
                            
                                Why does Python's dict.keys() return a list and not a set?
                            
                                PYTHONPATH vs. sys.path
                            
                                how to tell a variable is iterable but not a string
                            
                                Why is parenthesis in print voluntary in Python 2.7?
                            
                                What is the __dict__.__dict__ attribute of a Python class?
                            
                                Any gotchas using unicode_literals in Python 2.6?
                            
                                How to use requirements.txt to install all dependencies in a python project
                            
                                Weird Try-Except-Else-Finally behavior with Return statements
                            
                                Django filter many-to-many with contains
                            
                                How to get folder name, in which given file resides, from pathlib.path?
                            
                                Prevent pandas from interpreting 'NA' as NaN in a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read a Parquet file into Pandas DataFrame?

Tags:

python

pandas

dataframe

parquet

blaze

Daniel Mahler

People also ask

2 Answers

chrisaycock

danielfrg

Recent Activity

Donate For Us