I have this dataframe:
df = pd.DataFrame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})
value type entity
0 1 big R
1 2 small R
2 3 medium R
3 4 big P
4 2 big R
5 42 big P
6 12 big P
7 21 medium P
8 21 small R
9 424 small R
10 34 small P
11 12 medium R
12 42 small R
The operation consists of grouping by column 'entity' doing a count operation based on a two logical conditions applied to a column 'value' and column 'type'. In my case, I have to count the values greater than 3 in the column 'name' and are not equal to 'medium' in the column 'type'. The result must be R=3 and P=4. After this, I must add the result to the original dataframe creating a new column named ‘Count’. I know this operation can be done in R with the next code:
df[y!='medium' & value>3 , new_var:=.N,by=entity]
df[is.na(new_var),new_var:=0,]
df[,new_var:=max(new_var),by=entity]
In a previous task, I had to calculate only the values greater than 3 as condition. In that case, the result was R=3 and P=4 and I got it applying the next code:
In []: df.groupby(['entity'])['value'].apply(lambda x: (x>3).sum())
Out[]: entity
P 5
R 4
Name: value, dtype: int64
In []: DF=pd.DataFrame(DF)
In []: DF.reset_index(inplace=True)
In []: df.merge(DF,on=['entity'],how='inner')
In []: df=df.rename(columns={'value_x':'value','value_y':'count'},inplace=True)
Out[]:
value type entity count
0 1 big R 4
1 2 small R 4
2 3 medium R 4
3 2 big R 4
4 21 small R 4
5 424 small R 4
6 12 medium R 4
7 42 small R 4
8 4 big P 5
9 42 big P 5
10 12 big P 5
11 21 medium P 5
12 34 small P 5
My questions are: How do I do it for the two conditions case? In fact, How do I do it for a general case with multiples different conditions?
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.
Create mask by your conditions - here for greater by Series.gt
with not equal by Series.ne
chained by &
for bitwise AND
and then use GroupBy.transform
for count True
s by sum
:
mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = mask.groupby(df['entity']).transform('sum')
Solution with helper column new
:
mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = df.assign(new = mask).groupby('entity')['new'].transform('sum')
print (df)
value type entity count
0 1 big R 3
1 2 small R 3
2 3 medium R 3
3 4 big P 4
4 2 big R 3
5 42 big P 4
6 12 big P 4
7 21 medium P 4
8 21 small R 3
9 424 small R 3
10 34 small P 4
11 12 medium R 3
12 42 small R 3
The solution in Pandas is superb. This is an alternative in a different package. The reason I am throwing this in here is because the original code was in data.table
in R, and it might be useful for others, who probably want a similar solution within Python.
This is a solution in pydatatable, a library that aims to replicate data.table
in python. Note that it is not as feature rich as Pandas; hopefully, with time, more features will be added.
Create the frame with datatable
:
from datatable import dt, f, by, update
df = dt.Frame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})
Create the condition - In datatable, the f
symbol is a shortcut to refer to the dataframe:
condition = (f.type!="medium") & (f.value>3)
The syntax below should be familiar to users of data.table
,
DT[i, j, by]
where i
refers to anything that can occur in the rows, j
refers to column operations, and by
is for grouping operations. The update function is similar in function to the :=
function in data.table
; it allows for creation of new columns or update of existing columns in place.
df[:, update(count=dt.sum(condition)), by('entity')]
df
value type entity count
0 1 big R 3
1 2 small R 3
2 3 medium R 3
3 4 big P 4
4 2 big R 3
5 42 big P 4
6 12 big P 4
7 21 medium P 4
8 21 small R 3
9 424 small R 3
10 34 small P 4
11 12 medium R 3
12 42 small R 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With