Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Type hinting Pandas DataFrame content and columns

I am writing a function that returns a Pandas DataFrame object. I would like to have a type hint that specifies which columns this DataFrame contains, besides just specifying in the docstring, to make it easier for the end user to read the data.

Is there a way to type hint DataFrame content like this? Ideally, this would integrate well with tools like Visual Studio Code and PyCharm when editing Python files and Jupyter Notebooks.

An example function:

def generate_data(bunch, of, inputs) -> pd.DataFrame:
    """Massages the input to a nice and easy DataFrame.
    
    :return:
         DataFrame with columns a(int), b(float), c(string), d(us dollars as float)
    """
like image 898
Mikko Ohtamaa Avatar asked Dec 28 '25 12:12

Mikko Ohtamaa


1 Answers

The most powerful project for strong typing of pandas DataFrame as of now (Apr 2023) is pandera. Unfortunately, what it offers is quite limited and far from what we might have wanted.

Here is an example of how you can use pandera in your case:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame

class MySchema(pa.DataFrameModel):
    a: int
    b: float
    c: str = pa.Field(nullable=True)  # For example, allow None values
    d: float    # US dollars

class OtherSchema(pa.DataFrameModel):
    year: int = pa.Field(ge=1900, le=2050)


def generate_data() -> DataFrame[MySchema]:
    df = pd.DataFrame({
        "a": [1, 2, 3],
        "b": [10.0, 20.0, 30.0],
        "c": ["A", "B", "C"],
        "d": [0.1, 0.2, 0.3],
    })

    # Runtime verification here, throws on schema mismatch
    strongly_typed_df = DataFrame[MySchema](df)
    return strongly_typed_df

def transform(input: DataFrame[MySchema]) -> DataFrame[OtherSchema]:
    # This demonstrates that you can use strongly
    # typed column names from the schema
    df = input.filter(items=[MySchema.a]).rename(
            columns={MySchema.a: OtherSchema.year}
    )

    return DataFrame[OtherSchema](df) # This will throw on range validation!


df1 = generate_data()
df2 = transform(df1)
transform(df2)   # mypy prints error here - incompatible type!

You can see mypy producing static type check error on the last line:

enter image description here

Discussion of advantages and limitations

With pandera we get –

  1. Clear and readable (dataclass style) DataFrame schema definitions and ability to use them as type hints.
  2. Run-time schema verification. Schema can define even more constraints than just types (see year in the example below and pandera docs for more).
  3. Experimental support for static type checking by mypy.

What we still miss –

  1. Full static type checking for column level verification.
  2. Any IDE support for column name auto-completion.
  3. Inline syntax for schema declaration, we have to explicitly define each schema as separate class before using it.

More examples

Pandera docs - https://pandera.readthedocs.io/en/stable/dataframe_models.html

Similar question - Type hints for a pandas DataFrame with mixed dtypes

Other typing projects

pandas-stubs is an active project providing type declarations for the pandas public API which is richer than type stubs included in pandas itself. But it doesn't provide any facilities for column level schemas.

There are quite a few outdated libraries related to this and pandas typing in general - dataenforce, data-science-types, python-type-stubs

pandera provides two different APIs that seem to be equally powerful - object-based API and class-based API. I demonstrate the later here.

like image 159
vvv444 Avatar answered Dec 31 '25 00:12

vvv444



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!