Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get schema of parquet file in Python

Tags:

python

parquet

Is there any python library that can be used to just get the schema of a parquet file?

Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema.

like image 581
Saran Avatar asked Jan 10 '17 10:01

Saran


People also ask

How do I find the schema of a parquet file?

You can also grab the schema of a Parquet file with Spark. From the StructType object, you can infer the column name, data type, and nullable property that's in the Parquet metadata. The Spark approach isn't as clean as the Arrow approach. Save this answer.

Does parquet file have schema?

Self-describing: In addition to data, a Parquet file contains metadata including schema and structure. Each file stores both the data and the standards used for accessing each record – making it easier to decouple services that write, store, and read Parquet files.

How do I extract data from a parquet file in Python?

With the query results stored in a DataFrame, we can use petl to extract, transform, and load the Parquet data. In this example, we extract Parquet data, sort the data by the Column1 column, and load the data into a CSV file.

How to write to a Parquet file in Python?

How to write to a Parquet file in Python Python package. First, we must install and import the PyArrow package. ... After that, we have to import PyArrow and... Defining a schema. Column types can be automatically inferred, but for the sake of completeness, I am going to define... Columns and ...

How do I get the schema of a Parquet file?

You can also grab the schema of a Parquet file with Spark. From the StructType object, you can infer the column name, data type, and nullable property that's in the Parquet metadata. The Spark approach isn't as clean as the Arrow approach. parquet-cli is a light weight alternative to parquet-tools.

How do I get metadata from a Parquet file?

Apache Arrow makes it easy to get the Parquet metadata with a lot of different languages including C, C++, Rust, Go, Java, JavaScript, etc. See here for more details about how to read metadata information from Parquet files with PyArrow. You can also grab the schema of a Parquet file with Spark.

Is there a parquet-CLI tool?

parquet-cli is a light weight alternative to parquet-tools. This tool will provide basic info about the parquet file. Show activity on this post. Maybe it's capable to use a desktop application to view Parquet and also other binary format data like ORC and AVRO.


5 Answers

As other commentors have mentioned, PyArrow is the easiest way to grab the schema of a Parquet file with Python. My answer goes into more detail about the schema that's returned by PyArrow and the metadata that's stored in Parquet files.

import pyarrow.parquet as pq

table = pq.read_table(path)
table.schema # returns the schema

Here's how to create a PyArrow schema (this is the object that's returned by table.schema):

import pyarrow as pa

pa.schema([
    pa.field("id", pa.int64(), True),
    pa.field("last_name", pa.string(), True),
    pa.field("position", pa.string(), True)])

Each PyArrow Field has name, type, nullable, and metadata properties. See here for more details on how to write custom file / column metadata to Parquet files with PyArrow.

The type property is for PyArrow DataType objects. pa.int64() and pa.string() are examples of PyArrow DataTypes.

Make sure you understand about column level metadata like min / max. That'll help you understand some of the cool features like predicate pushdown filtering that Parquet files allow for in big data systems.

like image 125
Powers Avatar answered Oct 10 '22 14:10

Powers


This function returns the schema of a local URI representing a parquet file. The schema is returned as a usable Pandas dataframe. The function does not read the whole file, just the schema.

import pandas as pd
import pyarrow.parquet


def read_parquet_schema_df(uri: str) -> pd.DataFrame:
    """Return a Pandas dataframe corresponding to the schema of a local URI of a parquet file.

    The returned dataframe has the columns: column, pa_dtype
    """
    # Ref: https://stackoverflow.com/a/64288036/
    schema = pyarrow.parquet.read_schema(uri, memory_map=True)
    schema = pd.DataFrame(({"column": name, "pa_dtype": str(pa_dtype)} for name, pa_dtype in zip(schema.names, schema.types)))
    schema = schema.reindex(columns=["column", "pa_dtype"], fill_value=pd.NA)  # Ensures columns in case the parquet file has an empty dataframe.
    return schema

It was tested with the following versions of the used third-party packages:

$ pip list | egrep 'pandas|pyarrow'
pandas             1.1.3
pyarrow            1.0.1
like image 35
Asclepius Avatar answered Oct 17 '22 04:10

Asclepius


This is supported by using pyarrow (https://github.com/apache/arrow/).

from pyarrow.parquet import ParquetFile
# Source is either the filename or an Arrow file handle (which could be on HDFS)
ParquetFile(source).metadata

Note: We merged the code for this only yesterday, so you need to build it from source, see https://github.com/apache/arrow/commit/f44b6a3b91a15461804dd7877840a557caa52e4e

like image 10
Uwe L. Korn Avatar answered Oct 17 '22 04:10

Uwe L. Korn


In addition to the answer by @mehdio, in case your parquet is a directory (e.g. a parquet generated by spark), to read the schema / column names:

import pyarrow.parquet as pq
pfile = pq.read_table("file.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))
like image 8
Galuoises Avatar answered Oct 17 '22 04:10

Galuoises


There's now an easiest way with the read_schema method. Note that it returns actually a dict where your schema is a bytes literal, so you need an extra step to convert your schema into a proper python dict.

from pyarrow.parquet import read_schema
import json

schema = read_schema(source)
schema_dict = json.loads(schema.metadata[b'org.apache.spark.sql.parquet.row.metadata'])['fields']
like image 4
mehdio Avatar answered Oct 17 '22 03:10

mehdio