Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternatives to nested numpy.where for multiconditional pandas operations?

I have a Pandas DataFrame with conditional column A and numeric column B.

    A    B
1 'foo' 1.2
2 'bar' 1.3
3 'foo' 2.2

I also have a Python dictionary that defines ranges of B which denote "success" given each value of A.

mydict = {'foo': [1, 2], 'bar': [2, 3]}

I want to make a new column, 'error', in the dataframe. It should describe how far outside of the acceptable bounds for A the value of B falls. If A is within the range, the value should be zero.

    A    B   error
1 'foo' 1.2   0
2 'bar' 1.3  -0.7
3 'foo' 2.2   0.2

I'm not a complete Pandas/Numpy newbie, and I'm halfway decent at Python, but this proved somewhat difficult. I don't want to do it with iterrows(), since I understand that's computationally expensive and this is going to get called a lot.

I eventually figured out a solution by combining lambda functions, pandas.DataFrame.map(), and nested numpy.where()s with given values for the optional x and y inputs.

getmin = lambda x: mydict[x][0]
getmax = lambda x: mydict[x][1] 
df['error'] = np.where(df.B < dtfr.A.map(getmin),
                       df.B - df.A.map(getmin),
                       np.where(df.B > df.A.map(getmax),
                                df.B - df.A.map(getmax),
                                0
                                )
                       )

It works, but this can't possibly be the best way to do this, right? I feel like I'm abusing numpy.where() to get around not knowing how to map values from multiple columns of a dataframe to a lambda function in a non-iterative way. (Also to avoid writing mildly gnarly lambda functions).

Kind of three questions, I guess.

  1. Is it OK to nest numpy.where()s for triconditional array operations?
  2. How can I non-iteratively map from two dataframe columns to one function?
  3. If 2) is possible and 1) is acceptable, which is preferable?
like image 650
Logan Davis Avatar asked Jun 11 '15 23:06

Logan Davis


2 Answers

For your question about how to map multiple columns, you do it with

DataFrame.apply( , axis =1)

For your question I don't think you need this, but I think it's clearer if you do your calculation in a few steps:

df['low'] = df.A.map(lambda x: mydict[x][0])
df['high'] = df.A.map(lambda x: mydict[x][1])
df['error'] = np.maximum(df.B - df.high, 0) + np.minimum(df.B - df.low, 0)
df
     A    B  low  high  error
1  foo  1.2    1     2    0.0
2  bar  1.3    2     3   -0.7
3  foo  2.2    1     2    0.2
like image 82
maxymoo Avatar answered Nov 14 '22 23:11

maxymoo


I believe the code below is arguably more readable.

df['min'] = df.A.apply(lambda x: min(mydict[x]))
df['max'] = df.A.apply(lambda x: max(mydict[x]))
df['error'] = 0.
df.loc[df.B.gt(df['max']), 'error'] = df.B - df['max']
df.loc[df.B.lt(df['min']), 'error'] = df.B - df['min']
df.drop(['min', 'max'], axis=1, inplace=True)
>>> df
     A    B  error
1  foo  1.2    0.0
2  bar  1.3   -0.7
3  foo  2.2    0.2

I don't see why you couldn't use numpy.where() for triconditional array operations, but you do sacrifice simplicity.

like image 45
Alexander Avatar answered Nov 15 '22 00:11

Alexander