Writing parquet files from Python without pandas

Tags:

I need to transform data from JSON to parquet as a part of an ETL pipeline. I'm currently doing it with the from_pandas method of a pyarrow.Table. However building a dataframe first feels like a unnecessary step, plus I'd like to avoid having pandas as a dependency.

Is there a way how to write parquet files without the need to load it in a dataframe first?

583

asked May 04 '18 12:05

Milan Cermak

1 Answers

At the moment, the most convenient way to build Parquet is using Pandas due to the maturity of it. Nevertheless, pyarrow also provides facilities to build it's tables from normal Python:

import pyarrow as pa

string_array = pa.array(['a', 'b', 'c'])
pa.Table.from_arrays([string_array], ['str'])

As Parquet is a columnar data format, you will have to load the data once into memory to do the row-wise to columnar data representation transformation.

At the moment, you also need to construct the Arrow arrays at once; you cannot build them up incrementally. In future, we plan to expose the (incremental) builder classes from C++: https://github.com/apache/arrow/pull/1930

answered Sep 28 '22 01:09

Uwe L. Korn

Related questions
                            
                                Django CAN find my static files, Pycharm CANNOT resolve them
                            
                                Can type hint in python 3 be used to generate docstring?
                            
                                Pandas: change order of crosstab result
                            
                                How do you wrap lines in a Jupyter notebook?
                            
                                Python, json dump a list with no newlines
                            
                                Face clustering using Chinese Whispers algorithm
                            
                                statsmodels installation: No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
                            
                                How to correctly use cv2.imwrite to save an image in openCV with cv2.selectROI
                            
                                Getting 403 error when accessing Google My Business API through Service Account
                            
                                ImportError: Package installed from Git using pip not found by Python
                            
                                Authenticating and authorizing users securely in a Python PyQt desktop application
                            
                                Graphical user interface with TK - button position and actions
                            
                                What is the best way to save tensor value to file as binary format?
                            
                                Alembic migrate with existing SQLAlchemy engine
                            
                                What is the most pythonic way to have inverse enumerate of a list? [duplicate]
                            
                                Performance decrease for huge amount of columns. Pyspark
                            
                                When should I raise LookupError in python?
                            
                                Efficiently yield elements from large list in (pseudo) random order
                            
                                Adding a Non-Primary Key AutoField or a 'serial' field in a Django Model which uses a UUID field as a Primary Field
                            
                                Can I make type aliases for type constructors in python using the typing module?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing parquet files from Python without pandas

Tags:

python

parquet

pyarrow

Milan Cermak

People also ask

1 Answers

Uwe L. Korn

Recent Activity

Donate For Us