Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

like image 304
Daniel Mahler Avatar asked Nov 19 '15 20:11

Daniel Mahler


People also ask

How do I convert from parquet to Pandas?

Pandas DataFrame: to_parquet() function The to_parquet() function is used to write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

Which function do you use to read a Parquet file into a DataFrame?

Parquet files are always large. so read it using dask.

Does Pandas support parquet?

We can alter a standard Pandas-based data processing pipeline where it reads data from CSV files to one where it reads files in Parquet format, internally converts them to Pandas DataFrame, performs all the analytics tasks, and still be faster most of the time.


2 Answers

pandas 0.21 introduces new functions for Parquet:

pd.read_parquet('example_pa.parquet', engine='pyarrow') 

or

pd.read_parquet('example_fp.parquet', engine='fastparquet') 

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

like image 176
chrisaycock Avatar answered Sep 25 '22 10:09

chrisaycock


Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

like image 24
danielfrg Avatar answered Sep 24 '22 10:09

danielfrg