Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby based on a multiple logical conditions applied to a different columns DataFrame

Tags:

python

pandas

I have this dataframe:

df = pd.DataFrame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})

    value    type  entity
0       1     big       R
1       2   small       R
2       3  medium       R
3       4     big       P
4       2     big       R
5      42     big       P
6      12     big       P
7      21  medium       P
8      21   small       R
9     424   small       R
10     34   small       P
11     12  medium       R
12     42   small       R

The operation consists of grouping by column 'entity' doing a count operation based on a two logical conditions applied to a column 'value' and column 'type'. In my case, I have to count the values greater than 3 in the column 'name' and are not equal to 'medium' in the column 'type'. The result must be R=3 and P=4. After this, I must add the result to the original dataframe creating a new column named ‘Count’. I know this operation can be done in R with the next code:

df[y!='medium' & value>3 , new_var:=.N,by=entity]
df[is.na(new_var),new_var:=0,]
df[,new_var:=max(new_var),by=entity]

In a previous task, I had to calculate only the values greater than 3 as condition. In that case, the result was R=3 and P=4 and I got it applying the next code:

In []:  df.groupby(['entity'])['value'].apply(lambda x: (x>3).sum())

Out[]:  entity
        P    5
        R    4
        Name: value, dtype: int64

In []:  DF=pd.DataFrame(DF)
In []:  DF.reset_index(inplace=True)
In []:  df.merge(DF,on=['entity'],how='inner')
In []:  df=df.rename(columns={'value_x':'value','value_y':'count'},inplace=True)
Out[]:  

    value   type     entity  count
0      1     big          R      4
1      2   small          R      4
2      3  medium          R      4
3      2     big          R      4
4     21   small          R      4
5    424   small          R      4
6     12  medium          R      4
7     42   small          R      4
8      4     big          P      5
9     42     big          P      5
10    12     big          P      5
11    21  medium          P      5
12    34   small          P      5

My questions are: How do I do it for the two conditions case? In fact, How do I do it for a general case with multiples different conditions?

like image 466
JASoto Avatar asked Oct 06 '20 06:10

JASoto


People also ask

Can you use Groupby with multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

What is possible using Groupby () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.


Video Answer


2 Answers

Create mask by your conditions - here for greater by Series.gt with not equal by Series.ne chained by & for bitwise AND and then use GroupBy.transform for count Trues by sum:

mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = mask.groupby(df['entity']).transform('sum')

Solution with helper column new:

mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = df.assign(new = mask).groupby('entity')['new'].transform('sum')

print (df)
    value    type entity  count
0       1     big      R      3
1       2   small      R      3
2       3  medium      R      3
3       4     big      P      4
4       2     big      R      3
5      42     big      P      4
6      12     big      P      4
7      21  medium      P      4
8      21   small      R      3
9     424   small      R      3
10     34   small      P      4
11     12  medium      R      3
12     42   small      R      3
like image 197
jezrael Avatar answered Oct 22 '22 02:10

jezrael


The solution in Pandas is superb. This is an alternative in a different package. The reason I am throwing this in here is because the original code was in data.table in R, and it might be useful for others, who probably want a similar solution within Python.

This is a solution in pydatatable, a library that aims to replicate data.table in python. Note that it is not as feature rich as Pandas; hopefully, with time, more features will be added.

Create the frame with datatable :

   from datatable import dt, f, by, update

    df = dt.Frame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})

Create the condition - In datatable, the f symbol is a shortcut to refer to the dataframe:

condition = (f.type!="medium") & (f.value>3)

The syntax below should be familiar to users of data.table,

 DT[i, j, by] 

where i refers to anything that can occur in the rows, j refers to column operations, and by is for grouping operations. The update function is similar in function to the := function in data.table; it allows for creation of new columns or update of existing columns in place.

df[:, update(count=dt.sum(condition)), by('entity')]

df

 value  type    entity  count
0   1   big     R       3
1   2   small   R       3
2   3   medium  R       3
3   4   big     P       4
4   2   big     R       3
5   42  big     P       4
6   12  big     P       4
7   21  medium  P       4
8   21  small   R       3
9   424 small   R       3
10  34  small   P       4
11  12  medium  R       3
12  42  small   R       3
like image 34
sammywemmy Avatar answered Oct 22 '22 00:10

sammywemmy