Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow.

I found this blog post (a basic comparison of speeds).

and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw is it still the case?)

when/why would I use one over the other? what are the major advantages and disadvantages ?


my specific use case is processing data with dask writing it to s3 and then reading/analyzing it with AWS-athena.

like image 471
moshevi Avatar asked Jul 16 '18 12:07

moshevi


People also ask

Which is better Pyarrow or Fastparquet?

According to it, pyarrow is faster than fastparquet, little wonder it is the default engine used in dask.

What is Fastparquet?

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows. It is used implicitly by the projects Dask, Pandas and intake-parquet.

What is Pyarrow?

This is the documentation of the Python API of Apache Arrow. Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.

Why you should use parquet files with pandas?

With its column-oriented design, Parquet brings many efficient storage characteristics (e.g., blocks, row group, column chunks) into the fold. Additionally, it is built to support very efficient compression and encoding schemes for realizing space-saving data pipelines.


2 Answers

I used both fastparquet and pyarrow for converting protobuf data to parquet and to query the same in S3 using Athena. Both worked, however, in my use-case, which is a lambda function, package zip file has to be lightweight, so went ahead with fastparquet. (fastparquet library was only about 1.1mb, while pyarrow library was 176mb, and Lambda package limit is 250mb).

I used the following to store a dataframe as parquet file:

from fastparquet import write  parquet_file = path.join(filename + '.parq') write(parquet_file, df_data) 
like image 168
Daenerys Avatar answered Sep 16 '22 15:09

Daenerys


However, since the question lacks concrete criteria, and I came here for a good "default choice", I want to state that pandas default engine for DataFrame objects is pyarrow (see pandas docs).

like image 43
d4tm4x Avatar answered Sep 19 '22 15:09

d4tm4x