Get schema of parquet file in Python

Tags:

parquet

Is there any python library that can be used to just get the schema of a parquet file?

Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema.

581

asked Jan 10 '17 10:01

Saran

5 Answers

As other commentors have mentioned, PyArrow is the easiest way to grab the schema of a Parquet file with Python. My answer goes into more detail about the schema that's returned by PyArrow and the metadata that's stored in Parquet files.

import pyarrow.parquet as pq

table = pq.read_table(path)
table.schema # returns the schema

Here's how to create a PyArrow schema (this is the object that's returned by table.schema):

import pyarrow as pa

pa.schema([
    pa.field("id", pa.int64(), True),
    pa.field("last_name", pa.string(), True),
    pa.field("position", pa.string(), True)])

Each PyArrow Field has name, type, nullable, and metadata properties. See here for more details on how to write custom file / column metadata to Parquet files with PyArrow.

The type property is for PyArrow DataType objects. pa.int64() and pa.string() are examples of PyArrow DataTypes.

Make sure you understand about column level metadata like min / max. That'll help you understand some of the cool features like predicate pushdown filtering that Parquet files allow for in big data systems.

125

answered Oct 10 '22 14:10

Powers

This function returns the schema of a local URI representing a parquet file. The schema is returned as a usable Pandas dataframe. The function does not read the whole file, just the schema.

import pandas as pd
import pyarrow.parquet


def read_parquet_schema_df(uri: str) -> pd.DataFrame:
    """Return a Pandas dataframe corresponding to the schema of a local URI of a parquet file.

    The returned dataframe has the columns: column, pa_dtype
    """
    # Ref: https://stackoverflow.com/a/64288036/
    schema = pyarrow.parquet.read_schema(uri, memory_map=True)
    schema = pd.DataFrame(({"column": name, "pa_dtype": str(pa_dtype)} for name, pa_dtype in zip(schema.names, schema.types)))
    schema = schema.reindex(columns=["column", "pa_dtype"], fill_value=pd.NA)  # Ensures columns in case the parquet file has an empty dataframe.
    return schema

It was tested with the following versions of the used third-party packages:

$ pip list | egrep 'pandas|pyarrow'
pandas             1.1.3
pyarrow            1.0.1

answered Oct 17 '22 04:10

Asclepius

This is supported by using pyarrow (https://github.com/apache/arrow/).

from pyarrow.parquet import ParquetFile
# Source is either the filename or an Arrow file handle (which could be on HDFS)
ParquetFile(source).metadata

Note: We merged the code for this only yesterday, so you need to build it from source, see https://github.com/apache/arrow/commit/f44b6a3b91a15461804dd7877840a557caa52e4e

answered Oct 17 '22 04:10

Uwe L. Korn

In addition to the answer by @mehdio, in case your parquet is a directory (e.g. a parquet generated by spark), to read the schema / column names:

import pyarrow.parquet as pq
pfile = pq.read_table("file.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))

answered Oct 17 '22 04:10

Galuoises

There's now an easiest way with the read_schema method. Note that it returns actually a dict where your schema is a bytes literal, so you need an extra step to convert your schema into a proper python dict.

from pyarrow.parquet import read_schema
import json

schema = read_schema(source)
schema_dict = json.loads(schema.metadata[b'org.apache.spark.sql.parquet.row.metadata'])['fields']

answered Oct 17 '22 03:10

mehdio

Related questions
                            
                                Enum in Python doesn't work as expected
                            
                                How to hide a layout in PyQt?
                            
                                Determine if an image exists within a larger image, and if so, find it, using Python
                            
                                How to emit websocket message from outside a websocket endpoint?
                            
                                datetime format change when save to csv file python
                            
                                NoneType object is not iterable error in pandas
                            
                                Send JSON to Flask using requests
                            
                                How does nltk.pos_tag() work?
                            
                                Downgrade to previous version of Spyder
                            
                                "datetime": 'module' object has no attribute 'now'
                            
                                Resize a QGraphicsItem with the mouse
                            
                                Nginx 504 Gateway Timeout Error for Django
                            
                                Tkinter: grid or pack inside a grid?
                            
                                How to generate a random number in a Template Django python?
                            
                                "\n" in strings not working
                            
                                'str' object is not callable Django Rest Framework
                            
                                Combine two strings (char by char) and repeat last char of shortest one
                            
                                convert requests.models.Response to Django HttpResponse
                            
                                Add custom button to django admin panel
                            
                                How to use Python `secret` module to generate random integer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get schema of parquet file in Python

Tags:

python

parquet

Saran

People also ask

5 Answers

Powers

Asclepius

Uwe L. Korn

Galuoises

mehdio

Recent Activity

Donate For Us