I am trying to translate a pipeline of manipulations on a dataframe in R over to its Python equivalent. A basic example of the pipeline is as follows, incorporating a few mutate
and filter
calls:
library(tidyverse)
calc_circle_area <- function(diam) pi / 4 * diam^2
calc_cylinder_vol <- function(area, length) area * length
raw_data <- tibble(cylinder_name=c('a', 'b', 'c'), length=c(3, 5, 9), diam=c(1, 2, 4))
new_table <- raw_data %>%
mutate(area = calc_circle_area(diam)) %>%
mutate(vol = calc_cylinder_vol(area, length)) %>%
mutate(is_small_vol = vol < 100) %>%
filter(is_small_vol)
I can replicate this in pandas without too much trouble but find that it involves some nested lambda
calls when using assign
to do an apply
(first where the dataframe caller is an argument, and subsequently with dataframe rows as the argument). This tends to obscure the meaning of the assign call, where I would like to specify something more to the point (like the R version) if at all possible.
import pandas as pd
import math
calc_circle_area = lambda diam: math.pi / 4 * diam**2
calc_cylinder_vol = lambda area, length: area * length
raw_data = pd.DataFrame({'cylinder_name': ['a', 'b', 'c'], 'length': [3, 5, 9], 'diam': [1, 2, 4]})
new_table = (
raw_data
.assign(area=lambda df: df.diam.apply(lambda r: calc_circle_area(r.diam), axis=1))
.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
.assign(is_small_vol=lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
I am aware that the .assign(area=lambda df: df.diam.apply(calc_circle_area))
could be written as .assign(area=raw_data.diam.apply(calc_circle_area))
but only because the diam
column already exists in the original dataframe, which may not always be the case.
I also realize that the calc_...
functions here are vectorizable, meaning I could also do things like
.assign(area=lambda df: calc_circle_area(df.diam))
.assign(vol=lambda df: calc_cylinder_vol(df.area, df.length))
but again, since most functions aren't vectorizable, this wouldn't work in most cases.
TL;DR I am wondering if there is a cleaner way to "mutate" columns on a dataframe that doesn't involve double-nesting lambda
statements, like in something like:
.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
Are there best practices for this type of application or is this the best one can do within the context of method chaining?
In Python, lambda functions are quite limited. They can take any number of arguments; however they can contain only one statement and be written on a single line. This will apply the anonymous function lambda x: x * 2 to every item returned by range(10) .
Performance: Creating a function with lambda is slightly faster than creating it with def . The difference is due to def creating a name entry in the locals table. The resulting function has the same execution speed.
A lambda function evaluates an expression for a given argument. You give the function a value (argument) and then provide the operation (expression). The keyword lambda must come first. A full colon (:) separates the argument and the expression.
The best practice is to vectorize operations.
The reason for this is performance, because apply
is very slow. You are already taking advantage of vectorization in the R code, and you should continue to do so in Python. You will find that, because of this performance consideration, most of the functions you need actually are vectorizable.
That will get rid of your inner lambdas. For the outer lambdas over the df
, I think what you have is the cleanest pattern. The alternative is to repeatedly reassign to the raw_data
variable, or some other intermediate variables(s), but this doesn't fit the method chaining style for which you are asking.
There are also Python packages like dfply that aim to mimic the dplyr
feel in Python. These do not receive the same level of support as core pandas
will, so keep that in mind if you want to go this route.
Or, if you want to just save a bit of typing, and all the functions will be only over columns, you can create a glue function that unpacks the columns for you and passes them along.
def df_apply(col_fn, *col_names):
def inner_fn(df):
cols = [df[col] for col in col_names]
return col_fn(*cols)
return inner_fn
Then usage ends up looking something like this:
new_table = (
raw_data
.assign(area=df_apply(calc_circle_area, 'diam'))
.assign(vol=df_apply(calc_cylinder_vol, 'area', 'length'))
.assign(is_small_vol=lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
It is also possible to write this without taking advantage of vectorization, in case that does come up.
def df_apply_unvec(fn, *col_names):
def inner_fn(df):
def row_fn(row):
vals = [row[col] for col in col_names]
return fn(*vals)
return df.apply(row_fn, axis=1)
return inner_fn
I used named functions for extra clarity. But it can be condensed with lambdas into something that looks much like your original format, just generic.
as @mcskinner has pointed out, vectorized operations are way better and faster. if however, your operation cannot be vectorized and you still want to apply a function, you could use the pipe method, which should allow for a cleaner method chaining:
import math
def area(df):
df['area'] = math.pi/4*df['diam']**2
return df
def vol(df):
df['vol'] = df['area'] * df['length']
return df
new_table = (raw_data
.pipe(area)
.pipe(vol)
.assign(is_small_vol = lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
new_table
cylinder_name length diam area vol is_small_vol
0 a 3 1 0.785398 2.356194 True
1 b 5 2 3.141593 15.707963 True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With