Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid excessive lambda functions in pandas DataFrame assign and apply method chains

I am trying to translate a pipeline of manipulations on a dataframe in R over to its Python equivalent. A basic example of the pipeline is as follows, incorporating a few mutate and filter calls:

library(tidyverse)

calc_circle_area <- function(diam) pi / 4 * diam^2
calc_cylinder_vol <- function(area, length) area * length

raw_data <- tibble(cylinder_name=c('a', 'b', 'c'), length=c(3, 5, 9), diam=c(1, 2, 4))

new_table <- raw_data %>% 
  mutate(area = calc_circle_area(diam)) %>% 
  mutate(vol = calc_cylinder_vol(area, length)) %>% 
  mutate(is_small_vol = vol < 100) %>% 
  filter(is_small_vol)

I can replicate this in pandas without too much trouble but find that it involves some nested lambda calls when using assign to do an apply (first where the dataframe caller is an argument, and subsequently with dataframe rows as the argument). This tends to obscure the meaning of the assign call, where I would like to specify something more to the point (like the R version) if at all possible.

import pandas as pd
import math

calc_circle_area = lambda diam: math.pi / 4 * diam**2
calc_cylinder_vol = lambda area, length: area * length

raw_data = pd.DataFrame({'cylinder_name': ['a', 'b', 'c'], 'length': [3, 5, 9], 'diam': [1, 2, 4]})

new_table = (
    raw_data
        .assign(area=lambda df: df.diam.apply(lambda r: calc_circle_area(r.diam), axis=1))
        .assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
        .assign(is_small_vol=lambda df: df.vol < 100)
        .loc[lambda df: df.is_small_vol]
)

I am aware that the .assign(area=lambda df: df.diam.apply(calc_circle_area)) could be written as .assign(area=raw_data.diam.apply(calc_circle_area)) but only because the diam column already exists in the original dataframe, which may not always be the case.

I also realize that the calc_... functions here are vectorizable, meaning I could also do things like

.assign(area=lambda df: calc_circle_area(df.diam))
.assign(vol=lambda df: calc_cylinder_vol(df.area, df.length))

but again, since most functions aren't vectorizable, this wouldn't work in most cases.

TL;DR I am wondering if there is a cleaner way to "mutate" columns on a dataframe that doesn't involve double-nesting lambda statements, like in something like:

.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))

Are there best practices for this type of application or is this the best one can do within the context of method chaining?

like image 844
teepee Avatar asked Apr 16 '20 05:04

teepee


People also ask

What are the limitations of lambda function in Python?

In Python, lambda functions are quite limited. They can take any number of arguments; however they can contain only one statement and be written on a single line. This will apply the anonymous function lambda x: x * 2 to every item returned by range(10) .

Are lambda functions more efficient?

Performance: Creating a function with lambda is slightly faster than creating it with def . The difference is due to def creating a name entry in the locals table. The resulting function has the same execution speed.

What is the correct way to use a lambda function?

A lambda function evaluates an expression for a given argument. You give the function a value (argument) and then provide the operation (expression). The keyword lambda must come first. A full colon (:) separates the argument and the expression.


2 Answers

The best practice is to vectorize operations.

The reason for this is performance, because apply is very slow. You are already taking advantage of vectorization in the R code, and you should continue to do so in Python. You will find that, because of this performance consideration, most of the functions you need actually are vectorizable.

That will get rid of your inner lambdas. For the outer lambdas over the df, I think what you have is the cleanest pattern. The alternative is to repeatedly reassign to the raw_data variable, or some other intermediate variables(s), but this doesn't fit the method chaining style for which you are asking.

There are also Python packages like dfply that aim to mimic the dplyr feel in Python. These do not receive the same level of support as core pandas will, so keep that in mind if you want to go this route.


Or, if you want to just save a bit of typing, and all the functions will be only over columns, you can create a glue function that unpacks the columns for you and passes them along.

def df_apply(col_fn, *col_names):
    def inner_fn(df):
        cols = [df[col] for col in col_names]
        return col_fn(*cols)
    return inner_fn

Then usage ends up looking something like this:

new_table = (
    raw_data
        .assign(area=df_apply(calc_circle_area, 'diam'))
        .assign(vol=df_apply(calc_cylinder_vol, 'area', 'length'))
        .assign(is_small_vol=lambda df: df.vol < 100)
        .loc[lambda df: df.is_small_vol]
)

It is also possible to write this without taking advantage of vectorization, in case that does come up.

def df_apply_unvec(fn, *col_names):
    def inner_fn(df):
        def row_fn(row):
            vals = [row[col] for col in col_names]
            return fn(*vals)
        return df.apply(row_fn, axis=1)
    return inner_fn

I used named functions for extra clarity. But it can be condensed with lambdas into something that looks much like your original format, just generic.

like image 121
mcskinner Avatar answered Oct 19 '22 03:10

mcskinner


as @mcskinner has pointed out, vectorized operations are way better and faster. if however, your operation cannot be vectorized and you still want to apply a function, you could use the pipe method, which should allow for a cleaner method chaining:

import math

def area(df):
    df['area'] = math.pi/4*df['diam']**2
    return df

def vol(df):
    df['vol'] = df['area'] * df['length']
    return df

new_table = (raw_data
             .pipe(area)
             .pipe(vol)
             .assign(is_small_vol = lambda df: df.vol < 100)
             .loc[lambda df: df.is_small_vol]
             )

new_table

    cylinder_name   length  diam    area     vol    is_small_vol
0       a             3      1    0.785398  2.356194    True
1       b             5      2    3.141593  15.707963   True
like image 32
sammywemmy Avatar answered Oct 19 '22 04:10

sammywemmy