Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible in polars to give the full schema of a LazyFrame/DataFrame in a function argument, and get type errors?

There are occasions when I know ahead of time the full schema of a table I'm working with. In those scenarios, it would be nice to be able to specify the full schema (call it a FullyDefinedFrame). Then the type system could help me out with things like:

  1. error when accessing a column that doesn't exist
  2. type checking. E.g can't add a string column to an int column
  3. have it generate a new schema from an operation on a FullyDefinedFrame.
  4. Combine 1. and 3. to do, e.g. "you performed a pivot, and then accidentally accessed a column that doesn't exist anymore"

I understand that polars does this at run time once it has the full schema of the data it's working on. But what if you could get all that information while still developing?

At the moment, I imagine you could get a crummy version of this experience by having a tool that creates a dummy LazyFrame/DataFrame with the schema of the FullyDefinedFrame, and then call your functions on it, and give you the results.

Is this possible in general? And if so, what would it take to make it work?

like image 638
natemcintosh Avatar asked Oct 30 '25 00:10

natemcintosh


1 Answers

The closest I have come so far, is to write my functions with LazyFrames, and then write a test that calls the function with a LazyFrame with the correct schema, but no data. This is all based on the documentation on type checking in the lazy API.

# example.py
from datetime import date
import polars as pl


def my_fn(lf: pl.LazyFrame) -> pl.LazyFrame:
    """
    Expects the schema
    Schema({'date': pl.Date, 'employee_id': pl.Int32, 'value': pl.Float64})

    Performs a filter, then groupby, and sums the values.
    """
    return (
        lf.filter(pl.col("date") == date(2025, 1, 1))
        .group_by(["date", "employee_id"])
        .agg(pl.col("value").sum().alias("total_value"))
    )

and a test file

# test_my_fn.py
from example import my_fn
import polars as pl


def test_my_types_and_schema():
    # The schema for the input LazyFrame
    input_schema = pl.Schema(
        [
            ("date", pl.Date),
            ("employee_id", pl.Int32),
            ("value", pl.Float64),
        ]
    )

    # Create a LazyFrame with no data, but with the correct schema
    lf = pl.LazyFrame(schema=input_schema)

    # Call the function
    out = my_fn(lf)

    # This will raise an error if the type checking fails
    out.collect()

    # If we also know what the output schema should be, define it here, and
    # compare with the output schema of the function.
    expected_schema = pl.Schema(
        [
            ("date", pl.Date),
            ("employee_id", pl.Int32),
            ("total_value", pl.Float64),
        ]
    )
    assert expected_schema == out.collect_schema()

Then run the test to make sure all the operations are type safe, and the schema matches the expected schema.

If, for example, we change the date column in the input schema in the test to be a string ...

input_schema = pl.Schema(
        [
            ("date", pl.String),
            ("employee_id", pl.Int32),
            ("value", pl.String),
        ]
    )

and re-run the test, we get the error:

E       polars.exceptions.InvalidOperationError: cannot compare 'date/datetime/time' to a string value (create native python { 'date', 'datetime', 'time' } or compare to a temporal column)
E       
E       Resolved plan until failure:
E       
E               ---> FAILED HERE RESOLVING 'group_by' <---
E       FILTER [(col("date")) == (2025-01-01)] FROM
E         DF ["date", "employee_id", "value"]; PROJECT */3 COLUMNS

This works pretty well, but is definitely more work intensive than I had first hoped. Side effect is that it is nice for ensuring the

like image 93
natemcintosh Avatar answered Oct 31 '25 16:10

natemcintosh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!