Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested data in Parquet with Python

I have a file that has one JSON per line. Here is a sample:

{
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

I want to create a parquet file with columns such as:

product.id, product.price, product.specs.voltage, product.specs.color, user

I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).

I'm a heavy pandas and dask user, so the pipeline I'm trying to construct is json data -> dask -> parquet -> pandas, although if anyone has a simple example of creating and reading these nested encodings in parquet using Python I think that would be good enough :D

EDIT

So, after digging in the PRs I found this: https://github.com/dask/fastparquet/pull/177

which is basically what I want to do. Although, I still can't make it work all the way through. How exactly do I tell dask/fastparquet that my product column is nested?

  • dask version: 0.15.1
  • fastparquet version: 0.1.1
like image 813
Daniel Severo Avatar asked Jul 27 '17 04:07

Daniel Severo


People also ask

Can Parquet store nested data?

Parquet stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google.

What is the file extension for Parquet?

Apache Parquet is a popular column storage file format used by Hadoop systems, such as Pig, Spark, and Hive. The file format is language independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension . parquet .

Can Parquet store JSON?

So storing JSON values in a map column in Parquet avoids parsing JSON in every query, but it still requires reading chunks of data for all map values, not just for the selected keys.

What is the parquet Python connector for parquet?

With the CData Python Connector for Parquet, you can work with Parquet data just like you would with any database, including direct access to data in ETL packages like petl. Download a free, 30-day trial of the Parquet Python Connector to start building Python apps and scripts with connectivity to Parquet data.

Does parquet have a nested encoding in Python?

I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).

Why is it important to have nested types in parquet?

It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.

How do I extract data from parquet in Python?

In this example, we extract Parquet data, sort the data by the Column1 column, and load the data into a CSV file. With the CData Python Connector for Parquet, you can work with Parquet data just like you would with any database, including direct access to data in ETL packages like petl.


1 Answers

Implementing the conversions on both the read and write path for arbitrary Parquet nested data is quite complicated to get right -- implementing the shredding and reassembly algorithm with associated conversions to some Python data structures. We have this on the roadmap in Arrow / parquet-cpp (see https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), but it has not been completed yet (only support for simple structs and lists/arrays are supported now). It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.

This can be analogously implemented in fastparquet as well, but it's going to be a lot of work (and test cases to write) no matter how you slice it.

I will likely take on the work (in parquet-cpp) personally later this year if no one beats me to it, but I would love to have some help.

like image 185
Wes McKinney Avatar answered Oct 01 '22 18:10

Wes McKinney