I have a polars dataframe that contains arguments to functions.
import polars as pl
df = pl.DataFrame(
{
"foo": [1, 2, 3],
"bar": [6.0, 7.0, 8.0],
"ham": ["a", "b", "c"],
}
)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6.0 ┆ a │
│ 2 ┆ 7.0 ┆ b │
│ 3 ┆ 8.0 ┆ c │
└─────┴─────┴─────┘
I want to apply a UDF to each row, performs a calculation, and then return a dataframe for each row. Each returned dataframe has the same schema, but a varying number of rows. The result should be a single dataframe.
e.g., simplified dummy example:
I tried this:
def myUDF(row_tuple):
foo, bar, ham = row_tuple
result = pl.DataFrame({
"a": foo + bar,
"b": ham
})
return (result,)
df.map_rows(myUDF)
shape: (3, 1)
┌────────────────┐
│ column_0 │
│ --- │
│ object │
╞════════════════╡
│ shape: (1, 2) │
│ ┌─────┬─────┐ │
│ │ a … │
│ shape: (1, 2) │
│ ┌─────┬─────┐ │
│ │ a … │
│ shape: (1, 2) │
│ ┌──────┬─────┐ │
│ │ a… │
└────────────────┘
This seems to work, but it means putting everything in a python dictionary. I'm worried about performance.
def myUDF(row_tuple):
foo, bar, ham = row_tuple
result = pl.DataFrame({
"a": foo + bar,
"b": ham
})
return (result.to_dict(),)
df.map_rows(myUDF).unnest("column_0").explode("a", "b")
shape: (3, 2)
┌──────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ str │
╞══════╪═════╡
│ 7.0 ┆ a │
│ 9.0 ┆ b │
│ 11.0 ┆ c │
└──────┴─────┘
Whats the "correct" way of doing this in polars?
The actual operation is more involved and I thought including it in detail would detract from the specific guidance I was looking for. I oversimplified my example too much however causing confusion - my apologies. The important part I should have emphasised above is: "...Each returned dataframe has the same schema, but a varying number of rows."
I hope its okay, but I've reworked the example to more faithfully mirror the operation I am trying to perform.
reference_data = pl.DataFrame({
"x": range(0, 10000000),
"y": [chr(ord('a') + (i%26)) for i in range(0, 10000000)]
})
def myUDF(row_tuple):
foo, bar, ham = row_tuple
result = (
reference_data
.slice(foo, int(bar / 2))
.with_columns(name=pl.lit(ham))
)
return (result.to_dict(),)
The result is:
df.map_rows(myUDF)
shape: (3, 1)
┌───────────────────────────────────┐
│ column_0 │
│ --- │
│ struct[3] │
╞═══════════════════════════════════╡
│ {[1, 2, 3],["b", "c", "d"],["a",… │
│ {[2, 3, 4],["c", "d", "e"],["b",… │
│ {[3, 4, … 6],["d", "e", … "g"],[… │
└───────────────────────────────────┘
The result I would like is:
df.map_rows(myUDF).unnest("column_0").explode(pl.all())
shape: (10, 3)
┌─────┬─────┬──────┐
│ x ┆ y ┆ name │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪══════╡
│ 1 ┆ b ┆ a │
│ 2 ┆ c ┆ a │
│ 3 ┆ d ┆ a │
│ 2 ┆ c ┆ b │
│ 3 ┆ d ┆ b │
│ 4 ┆ e ┆ b │
│ 3 ┆ d ┆ c │
│ 4 ┆ e ┆ c │
│ 5 ┆ f ┆ c │
│ 6 ┆ g ┆ c │
└─────┴─────┴──────┘
Of course the "right" way is to not use map_rows at all but, not withstanding that, the output of the function you give to map_rows should return a tuple and it will construct a new df for you. You need not, and should not, construct a df within the function.
keep with map_rows as its intended and rename columns manually
def myUDF(row_tuple):
foo, bar, ham = row_tuple
return (foo + bar, ham)
(
df
.map_rows(myUDF)
.rename({f"column_{x}":y for x,y in enumerate(["a","b"])})
)
keep with function returning a df but use concat and iter_rows
def myUDF(row_tuple):
foo, bar, ham = row_tuple
return pl.DataFrame({
"a": foo + bar,
"b": ham
})
pl.concat([
myUDF(x) for x in df.iter_rows()
])
Similar to option 2 but by making the return lazy, it does the computations in parallel. This only works for polars computations, any python execution will still be subject to the GIL and won't be parallel.
def myUDF(row_tuple):
foo, bar, ham = row_tuple
return (
pl.select().lazy()
.with_columns(a=pl.lit(foo) + pl.lit(bar), b=pl.lit(ham))
)
pl.concat([
myUDF(x) for x in df.iter_rows()
]).collect()
The last one is neat for the toy function because you get per row parallel execution but if your function can be made into expressions like this then you should just do it directly and not go through iter_rows or map_rows, that's why I called it a red herring. There are conceivably cases where rowwise parallelism is preferred to column based which is why I disclaimed red herring as only probably so.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With