I need to transform data from JSON to parquet as a part of an ETL pipeline. I'm currently doing it with the from_pandas
method of a pyarrow.Table. However building a dataframe first feels like a unnecessary step, plus I'd like to avoid having pandas as a dependency.
Is there a way how to write parquet files without the need to load it in a dataframe first?
The pyarrow library is able to construct a pandas. DataFrame faster than using pandas.
CSV to Parquet Using PyArrow Internally, Pandas' to_parquet() uses the pyarrow module. You can do the conversion from CSV to Parquet directly in pyarrow usinq parquet. write_table() . This removes one level of indirection, so it's slightly more efficient.
With its column-oriented design, Parquet brings many efficient storage characteristics (e.g., blocks, row group, column chunks) into the fold. Additionally, it is built to support very efficient compression and encoding schemes for realizing space-saving data pipelines.
At the moment, the most convenient way to build Parquet is using Pandas due to the maturity of it. Nevertheless, pyarrow
also provides facilities to build it's tables from normal Python:
import pyarrow as pa
string_array = pa.array(['a', 'b', 'c'])
pa.Table.from_arrays([string_array], ['str'])
As Parquet is a columnar data format, you will have to load the data once into memory to do the row-wise to columnar data representation transformation.
At the moment, you also need to construct the Arrow arrays at once; you cannot build them up incrementally. In future, we plan to expose the (incremental) builder classes from C++: https://github.com/apache/arrow/pull/1930
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With