I'm trying to figure out how to apply a lambda function to multiple dataframes simultaneously, without first merging the data frames together. I am working with large data sets (>60MM records) and I need to be extra careful with memory management.
My hope is that there is a way to apply lambda to just the underlying dataframes so that I can avoid the cost of stitching them together first, and then dropping that intermediary dataframe from memory before I move on to the next step in the process.
I have experience dodging out of memory issues by using HDF5 based dataframes, but I'd rather try exploring something different first.
I have provided a toy problem to help demonstrate what I am talking about.
import numpy as np
import pandas as pd
# Here's an arbitrary function to use with lambda
def someFunction(input1, input2, input3, input4):
theSum = input1 + input2
theAverage = (input1 + input2 + input3 + input4) / 4
theProduct = input2 * input3 * input4
return pd.Series({'Sum' : theSum, 'Average' : theAverage, 'Product' : theProduct})
# Cook up some dummy dataframes
df1 = pd.DataFrame(np.random.randn(6,2),columns=list('AB'))
df2 = pd.DataFrame(np.random.randn(6,1),columns=list('C'))
df3 = pd.DataFrame(np.random.randn(6,1),columns=list('D'))
# Currently, I merge the dataframes together and then apply the lambda function
dfConsolodated = pd.concat([df1, df2, df3], axis=1)
# This works just fine, but merging the dataframes seems like an extra step
dfResults = dfConsolodated.apply(lambda x: someFunction(x['A'], x['B'], x['C'], x['D']), axis = 1)
# I want to avoid the concat completely in order to be more efficient with memory. I am hoping for something like this:
# I am COMPLETELY making this syntax up for conceptual purposes, my apologies.
dfResultsWithoutConcat = [df1, df2, df3].apply(lambda x: someFunction(df1['A'], df1['B'], df2['C'], df3['D']), axis = 1)
I know this question is kind of old, but here is a way I came up with. It is not nice, but it works.
The basic idea is to query the second dataframe inside the applied function. By using the name of the passed series, you can identfiy the column/index and use it to retrieve the needed value from the other dataframe(s).
def func(x, other):
other_value = other.loc[x.name]
return your_actual_method(x, other_value)
result = df1.apply(lambda x: func(x, df2))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With