Polars UDF - returning and concatenating dataframes

Question

I have a polars dataframe that contains arguments to functions.

import polars as pl

df = pl.DataFrame(
    {
        "foo": [1, 2, 3],
        "bar": [6.0, 7.0, 8.0],
        "ham": ["a", "b", "c"],
    }
)

shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6.0 ┆ a   │
│ 2   ┆ 7.0 ┆ b   │
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘

I want to apply a UDF to each row, performs a calculation, and then return a dataframe for each row. Each returned dataframe has the same schema, but a varying number of rows. The result should be a single dataframe.

e.g., simplified dummy example:

I tried this:

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    result = pl.DataFrame({
        "a": foo + bar,
        "b": ham
    })

    return (result,)

df.map_rows(myUDF)

shape: (3, 1)
┌────────────────┐
│ column_0       │
│ ---            │
│ object         │
╞════════════════╡
│ shape: (1, 2)  │
│ ┌─────┬─────┐  │
│ │ a …          │
│ shape: (1, 2)  │
│ ┌─────┬─────┐  │
│ │ a …          │
│ shape: (1, 2)  │
│ ┌──────┬─────┐ │
│ │ a…           │
└────────────────┘

This seems to work, but it means putting everything in a python dictionary. I'm worried about performance.

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    result = pl.DataFrame({
        "a": foo + bar,
        "b": ham
    })

    return (result.to_dict(),)

df.map_rows(myUDF).unnest("column_0").explode("a", "b")

shape: (3, 2)
┌──────┬─────┐
│ a    ┆ b   │
│ ---  ┆ --- │
│ f64  ┆ str │
╞══════╪═════╡
│ 7.0  ┆ a   │
│ 9.0  ┆ b   │
│ 11.0 ┆ c   │
└──────┴─────┘

Whats the "correct" way of doing this in polars?

Follow-up information to respond to comments...

The actual operation is more involved and I thought including it in detail would detract from the specific guidance I was looking for. I oversimplified my example too much however causing confusion - my apologies. The important part I should have emphasised above is: "...Each returned dataframe has the same schema, but a varying number of rows."

I hope its okay, but I've reworked the example to more faithfully mirror the operation I am trying to perform.

reference_data = pl.DataFrame({
    "x": range(0, 10000000),
    "y": [chr(ord('a') + (i%26)) for i in range(0, 10000000)]
})

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    result = (
        reference_data
        .slice(foo, int(bar / 2))
        .with_columns(name=pl.lit(ham))
    )

    return (result.to_dict(),)

The result is:

df.map_rows(myUDF)

shape: (3, 1)
┌───────────────────────────────────┐
│ column_0                          │
│ ---                               │
│ struct[3]                         │
╞═══════════════════════════════════╡
│ {[1, 2, 3],["b", "c", "d"],["a",… │
│ {[2, 3, 4],["c", "d", "e"],["b",… │
│ {[3, 4, … 6],["d", "e", … "g"],[… │
└───────────────────────────────────┘

The result I would like is:

df.map_rows(myUDF).unnest("column_0").explode(pl.all())

shape: (10, 3)
┌─────┬─────┬──────┐
│ x   ┆ y   ┆ name │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ 1   ┆ b   ┆ a    │
│ 2   ┆ c   ┆ a    │
│ 3   ┆ d   ┆ a    │
│ 2   ┆ c   ┆ b    │
│ 3   ┆ d   ┆ b    │
│ 4   ┆ e   ┆ b    │
│ 3   ┆ d   ┆ c    │
│ 4   ┆ e   ┆ c    │
│ 5   ┆ f   ┆ c    │
│ 6   ┆ g   ┆ c    │
└─────┴─────┴──────┘

Dean MacGregor · Accepted Answer

Of course the "right" way is to not use map_rows at all but, not withstanding that, the output of the function you give to map_rows should return a tuple and it will construct a new df for you. You need not, and should not, construct a df within the function.

Option 1

keep with map_rows as its intended and rename columns manually

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    return (foo + bar, ham)

(
    df
    .map_rows(myUDF)
    .rename({f"column_{x}":y for x,y in enumerate(["a","b"])})
)

Option 2

keep with function returning a df but use concat and iter_rows

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    return pl.DataFrame({
        "a": foo + bar,
        "b": ham
    })
pl.concat([
    myUDF(x) for x in df.iter_rows()
])

Option probably red herring

Similar to option 2 but by making the return lazy, it does the computations in parallel. This only works for polars computations, any python execution will still be subject to the GIL and won't be parallel.

def myUDF(row_tuple):
    foo, bar, ham = row_tuple
    return (
        pl.select().lazy()
        .with_columns(a=pl.lit(foo) + pl.lit(bar), b=pl.lit(ham))
    )

pl.concat([
    myUDF(x) for x in df.iter_rows()
]).collect()

The last one is neat for the toy function because you get per row parallel execution but if your function can be made into expressions like this then you should just do it directly and not go through iter_rows or map_rows, that's why I called it a red herring. There are conceivably cases where rowwise parallelism is preferred to column based which is why I disclaimed red herring as only probably so.

Polars UDF - returning and concatenating dataframes

Tags:

python

python-polars

Follow-up information to respond to comments...

polars_user

1 Answers

Option 1

Option 2

Option probably red herring

Dean MacGregor

Recent Activity

Donate For Us

Polars UDF - returning and concatenating dataframes

Tags:

python

python-polars

Follow-up information to respond to comments...

polars_user

1 Answers

Option 1

Option 2

Option probably red herring

Dean MacGregor

Related questions

Recent Activity

Donate For Us