Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Polars UDF - returning and concatenating dataframes

I have a polars dataframe that contains arguments to functions.

import polars as pl

df = pl.DataFrame(
    {
        "foo": [1, 2, 3],
        "bar": [6.0, 7.0, 8.0],
        "ham": ["a", "b", "c"],
    }
)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6.0 ┆ a   │
│ 2   ┆ 7.0 ┆ b   │
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘

I want to apply a UDF to each row, performs a calculation, and then return a dataframe for each row. Each returned dataframe has the same schema, but a varying number of rows. The result should be a single dataframe.

e.g., simplified dummy example:

I tried this:

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    result = pl.DataFrame({
        "a": foo + bar,
        "b": ham
    })

    return (result,)
df.map_rows(myUDF)

shape: (3, 1)
┌────────────────┐
│ column_0       │
│ ---            │
│ object         │
╞════════════════╡
│ shape: (1, 2)  │
│ ┌─────┬─────┐  │
│ │ a …          │
│ shape: (1, 2)  │
│ ┌─────┬─────┐  │
│ │ a …          │
│ shape: (1, 2)  │
│ ┌──────┬─────┐ │
│ │ a…           │
└────────────────┘

This seems to work, but it means putting everything in a python dictionary. I'm worried about performance.

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    result = pl.DataFrame({
        "a": foo + bar,
        "b": ham
    })

    return (result.to_dict(),)

df.map_rows(myUDF).unnest("column_0").explode("a", "b")

shape: (3, 2)
┌──────┬─────┐
│ a    ┆ b   │
│ ---  ┆ --- │
│ f64  ┆ str │
╞══════╪═════╡
│ 7.0  ┆ a   │
│ 9.0  ┆ b   │
│ 11.0 ┆ c   │
└──────┴─────┘

Whats the "correct" way of doing this in polars?

Follow-up information to respond to comments...

The actual operation is more involved and I thought including it in detail would detract from the specific guidance I was looking for. I oversimplified my example too much however causing confusion - my apologies. The important part I should have emphasised above is: "...Each returned dataframe has the same schema, but a varying number of rows."

I hope its okay, but I've reworked the example to more faithfully mirror the operation I am trying to perform.

reference_data = pl.DataFrame({
    "x": range(0, 10000000),
    "y": [chr(ord('a') + (i%26)) for i in range(0, 10000000)]
})

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    result = (
        reference_data
        .slice(foo, int(bar / 2))
        .with_columns(name=pl.lit(ham))
    )

    return (result.to_dict(),)

The result is:

df.map_rows(myUDF)

shape: (3, 1)
┌───────────────────────────────────┐
│ column_0                          │
│ ---                               │
│ struct[3]                         │
╞═══════════════════════════════════╡
│ {[1, 2, 3],["b", "c", "d"],["a",… │
│ {[2, 3, 4],["c", "d", "e"],["b",… │
│ {[3, 4, … 6],["d", "e", … "g"],[… │
└───────────────────────────────────┘

The result I would like is:

df.map_rows(myUDF).unnest("column_0").explode(pl.all())

shape: (10, 3)
┌─────┬─────┬──────┐
│ x   ┆ y   ┆ name │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ 1   ┆ b   ┆ a    │
│ 2   ┆ c   ┆ a    │
│ 3   ┆ d   ┆ a    │
│ 2   ┆ c   ┆ b    │
│ 3   ┆ d   ┆ b    │
│ 4   ┆ e   ┆ b    │
│ 3   ┆ d   ┆ c    │
│ 4   ┆ e   ┆ c    │
│ 5   ┆ f   ┆ c    │
│ 6   ┆ g   ┆ c    │
└─────┴─────┴──────┘
like image 818
polars_user Avatar asked Feb 28 '26 20:02

polars_user


1 Answers

Of course the "right" way is to not use map_rows at all but, not withstanding that, the output of the function you give to map_rows should return a tuple and it will construct a new df for you. You need not, and should not, construct a df within the function.

Option 1

keep with map_rows as its intended and rename columns manually

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    return (foo + bar, ham)

(
    df
    .map_rows(myUDF)
    .rename({f"column_{x}":y for x,y in enumerate(["a","b"])})
)

Option 2

keep with function returning a df but use concat and iter_rows

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    return pl.DataFrame({
        "a": foo + bar,
        "b": ham
    })
pl.concat([
    myUDF(x) for x in df.iter_rows()
])

Option probably red herring

Similar to option 2 but by making the return lazy, it does the computations in parallel. This only works for polars computations, any python execution will still be subject to the GIL and won't be parallel.

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    return (
        pl.select().lazy()
        .with_columns(a=pl.lit(foo) + pl.lit(bar), b=pl.lit(ham))
    )

pl.concat([
    myUDF(x) for x in df.iter_rows()
]).collect()

The last one is neat for the toy function because you get per row parallel execution but if your function can be made into expressions like this then you should just do it directly and not go through iter_rows or map_rows, that's why I called it a red herring. There are conceivably cases where rowwise parallelism is preferred to column based which is why I disclaimed red herring as only probably so.

like image 139
Dean MacGregor Avatar answered Mar 03 '26 10:03

Dean MacGregor



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!