Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing parquet files from Python without pandas

I need to transform data from JSON to parquet as a part of an ETL pipeline. I'm currently doing it with the from_pandas method of a pyarrow.Table. However building a dataframe first feels like a unnecessary step, plus I'd like to avoid having pandas as a dependency.

Is there a way how to write parquet files without the need to load it in a dataframe first?

like image 583
Milan Cermak Avatar asked May 04 '18 12:05

Milan Cermak


People also ask

Is Pyarrow faster than pandas?

The pyarrow library is able to construct a pandas. DataFrame faster than using pandas.

How do I convert a CSV file to a parquet file in Python?

CSV to Parquet Using PyArrow Internally, Pandas' to_parquet() uses the pyarrow module. You can do the conversion from CSV to Parquet directly in pyarrow usinq parquet. write_table() . This removes one level of indirection, so it's slightly more efficient.

Why you should use parquet files with pandas?

With its column-oriented design, Parquet brings many efficient storage characteristics (e.g., blocks, row group, column chunks) into the fold. Additionally, it is built to support very efficient compression and encoding schemes for realizing space-saving data pipelines.


1 Answers

At the moment, the most convenient way to build Parquet is using Pandas due to the maturity of it. Nevertheless, pyarrow also provides facilities to build it's tables from normal Python:

import pyarrow as pa

string_array = pa.array(['a', 'b', 'c'])
pa.Table.from_arrays([string_array], ['str'])

As Parquet is a columnar data format, you will have to load the data once into memory to do the row-wise to columnar data representation transformation.

At the moment, you also need to construct the Arrow arrays at once; you cannot build them up incrementally. In future, we plan to expose the (incremental) builder classes from C++: https://github.com/apache/arrow/pull/1930

like image 64
Uwe L. Korn Avatar answered Sep 28 '22 01:09

Uwe L. Korn