I have a Pandas DataFrame with conditional column A and numeric column B.
A B
1 'foo' 1.2
2 'bar' 1.3
3 'foo' 2.2
I also have a Python dictionary that defines ranges of B which denote "success" given each value of A.
mydict = {'foo': [1, 2], 'bar': [2, 3]}
I want to make a new column, 'error', in the dataframe. It should describe how far outside of the acceptable bounds for A the value of B falls. If A is within the range, the value should be zero.
A B error
1 'foo' 1.2 0
2 'bar' 1.3 -0.7
3 'foo' 2.2 0.2
I'm not a complete Pandas/Numpy newbie, and I'm halfway decent at Python, but this proved somewhat difficult. I don't want to do it with iterrows(), since I understand that's computationally expensive and this is going to get called a lot.
I eventually figured out a solution by combining lambda functions, pandas.DataFrame.map(), and nested numpy.where()s with given values for the optional x and y inputs.
getmin = lambda x: mydict[x][0]
getmax = lambda x: mydict[x][1]
df['error'] = np.where(df.B < dtfr.A.map(getmin),
df.B - df.A.map(getmin),
np.where(df.B > df.A.map(getmax),
df.B - df.A.map(getmax),
0
)
)
It works, but this can't possibly be the best way to do this, right? I feel like I'm abusing numpy.where() to get around not knowing how to map values from multiple columns of a dataframe to a lambda function in a non-iterative way. (Also to avoid writing mildly gnarly lambda functions).
Kind of three questions, I guess.
For your question about how to map multiple columns, you do it with
DataFrame.apply( , axis =1)
For your question I don't think you need this, but I think it's clearer if you do your calculation in a few steps:
df['low'] = df.A.map(lambda x: mydict[x][0])
df['high'] = df.A.map(lambda x: mydict[x][1])
df['error'] = np.maximum(df.B - df.high, 0) + np.minimum(df.B - df.low, 0)
df
A B low high error
1 foo 1.2 1 2 0.0
2 bar 1.3 2 3 -0.7
3 foo 2.2 1 2 0.2
I believe the code below is arguably more readable.
df['min'] = df.A.apply(lambda x: min(mydict[x]))
df['max'] = df.A.apply(lambda x: max(mydict[x]))
df['error'] = 0.
df.loc[df.B.gt(df['max']), 'error'] = df.B - df['max']
df.loc[df.B.lt(df['min']), 'error'] = df.B - df['min']
df.drop(['min', 'max'], axis=1, inplace=True)
>>> df
A B error
1 foo 1.2 0.0
2 bar 1.3 -0.7
3 foo 2.2 0.2
I don't see why you couldn't use numpy.where() for triconditional array operations, but you do sacrifice simplicity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With