Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create complicated conditional column (geometric mean) Python

Tags:

python

pandas

I'm trying to create a column in a df that will tell me whether or not water at a given time was of swimmable quality. I'm hitting a wall though with geometric mean.

This is a mini sample df like what I'm working with:

df = pd.DataFrame({'Site': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'], 'EnteroCount': [1733, 4, 20, 150, 70, 1500, 55, 22, 180]})
df["Site"] = df["Site"].astype('category')

These are the conditions that make water unacceptable for swimming:
EnteroCount >= 110
OR
five or more samples with a geometric mean (a weighted average) greater than 30

I want to create a column that just assigns a value "acceptable" or "unacceptable" based on both of these conditions. I can do the following for the first condition, but bringing in the gmean isn't so easy (or is it?):

df['swim'] = np.where(df['EnteroCount']>=110, 'unacceptable', 'acceptable')

Would love to see how this can be done efficiently.

like image 680
Daina Avatar asked Mar 13 '23 17:03

Daina


1 Answers

This calculates the geometric mean of each site and checks if it is greater than 30:

>>> df['geo_mean_acceptable'] = (
        df.groupby('Site')
          .transform(lambda group: group.prod() ** (1 / float(len(group))) > 30)
          .astype(bool))

And this gets the geometric mean of each site:

>>> df.groupby('Site').EnteroCount.apply(lambda group: group.product() ** (1 / float(len(group))))
Site
A     68.016702
B    121.981006
C    180.000000
Name: EnteroCount, dtype: float64

Using the geometric mean function from scipy:

from scipy.stats.mstats import gmean

>>> df.groupby('Site').EnteroCount.apply(gmean)
Site
A     68.016702
B    121.981006
C    180.000000
Name: EnteroCount, dtype: float64

Given that the five highest values will give you the highest geometric mean in a group, you can use this:

df.groupby('Site').EnteroCount.apply(lambda group: gmean(group.nlargest(5)))

You can see how it is selecting the largest five values by group, which then get used as parameters for gmean:

>>> df.groupby('Site').EnteroCount.apply(lambda group: group.nlargest(5).values.tolist())
Site
A    [1733, 150, 70, 20, 4]
B            [1500, 55, 22]
C                     [180]
Name: EnteroCount, dtype: object

Summary

df['swim'] = np.where(
    (df.groupby('Site').EnteroCount.transform(max) > 110) |
    (df.groupby('Site').EnteroCount.transform(lambda group: gmean(group.nlargest(5))) > 30), 
    'unacceptable', 'acceptable')
like image 140
Alexander Avatar answered Mar 30 '23 00:03

Alexander