Nested data in Parquet with Python

Tags:

I have a file that has one JSON per line. Here is a sample:

{
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

I want to create a parquet file with columns such as:

Click to copy

product.id, product.price, product.specs.voltage, product.specs.color, user

I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).

I'm a heavy pandas and dask user, so the pipeline I'm trying to construct is json data -> dask -> parquet -> pandas, although if anyone has a simple example of creating and reading these nested encodings in parquet using Python I think that would be good enough :D

EDIT

So, after digging in the PRs I found this: https://github.com/dask/fastparquet/pull/177

which is basically what I want to do. Although, I still can't make it work all the way through. How exactly do I tell dask/fastparquet that my product column is nested?

dask version: 0.15.1
fastparquet version: 0.1.1

813

asked Jul 27 '17 04:07

Daniel Severo

1 Answers

Implementing the conversions on both the read and write path for arbitrary Parquet nested data is quite complicated to get right -- implementing the shredding and reassembly algorithm with associated conversions to some Python data structures. We have this on the roadmap in Arrow / parquet-cpp (see https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), but it has not been completed yet (only support for simple structs and lists/arrays are supported now). It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.

This can be analogously implemented in fastparquet as well, but it's going to be a lot of work (and test cases to write) no matter how you slice it.

I will likely take on the work (in parquet-cpp) personally later this year if no one beats me to it, but I would love to have some help.

185

answered Oct 01 '22 18:10

Wes McKinney

Related questions
                            
                                Port MATLAB bounding ellipsoid code to Python
                            
                                Cannot Insert Unicode Using cx-Oracle
                            
                                Is there a Python method to access all non-private and non-builtin attributes of a class?
                            
                                Tkinter window says (not responding) but code is running
                            
                                Discovering Poetic Form with NLTK and CMU Dict
                            
                                What is the meaning of "<" for Python dictionaries?
                            
                                ImportError: No module named serial
                            
                                Is it possible to set the marker edge alpha in Matplotlib?
                            
                                Avoiding MySQL deadlock in Django ORM
                            
                                CSV Exports - Ordering of columns using scrapy crawl -o output.csv
                            
                                Return list of objects as dictionary with keys as the objects id with django rest framerwork
                            
                                How can i combine flask and nameko?
                            
                                Is it possible to hide Python function arguments in Sphinx?
                            
                                ValueError: Series lengths must match to compare when matching dates in Pandas
                            
                                Python requests - threads/processes vs. IO
                            
                                Insert the folium maps into the jinja template
                            
                                How to plot pie charts as subplots with custom size with Plotly in Python
                            
                                How to index a list with a TensorFlow tensor?
                            
                                Increase Version number if Travis at github was successful
                            
                                What is Python's sequence protocol?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Nested data in Parquet with Python

Tags:

python

json

parquet

dask

Daniel Severo

People also ask

1 Answers

Wes McKinney

Recent Activity

Donate For Us