I have this dataframe: <pre class="prettyprint"><code>df = pd.DataFrame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42], 'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'], 'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']}) value type entity 0 1 big R 1 2 small R 2 3 medium R 3 4 big P 4 2 big R 5 42 big P 6 12 big P 7 21 medium P 8 21 small R 9 424 small R 10 34 small P 11 12 medium R 12 42 small R </code></pre> The operation consists of grouping by column 'entity' doing a count operation based on a two logical conditions applied to a column 'value' and column 'type'. In my case, I have to count the values greater than 3 in the column 'name' and are not equal to 'medium' in the column 'type'. The result must be R=3 and P=4. After this, I must add the result to the original dataframe creating a new column named ‘Count’. I know this operation can be done in R with the next code: <pre class="prettyprint"><code>df[y!='medium' & value>3 , new_var:=.N,by=entity] df[is.na(new_var),new_var:=0,] df[,new_var:=max(new_var),by=entity] </code></pre> In a previous task, I had to calculate only the values greater than 3 as condition. In that case, the result was R=3 and P=4 and I got it applying the next code: <pre class="prettyprint"><code>In []: df.groupby(['entity'])['value'].apply(lambda x: (x>3).sum()) Out[]: entity P 5 R 4 Name: value, dtype: int64 In []: DF=pd.DataFrame(DF) In []: DF.reset_index(inplace=True) In []: df.merge(DF,on=['entity'],how='inner') In []: df=df.rename(columns={'value_x':'value','value_y':'count'},inplace=True) Out[]: value type entity count 0 1 big R 4 1 2 small R 4 2 3 medium R 4 3 2 big R 4 4 21 small R 4 5 424 small R 4 6 12 medium R 4 7 42 small R 4 8 4 big P 5 9 42 big P 5 10 12 big P 5 11 21 medium P 5 12 34 small P 5 </code></pre> My questions are: How do I do it for the two conditions case? In fact, How do I do it for a general case with multiples different conditions?

Create mask by your conditions - here for greater by <code>Series.gt</code> with not equal by <code>Series.ne</code> chained by <code>&</code> for bitwise <code>AND</code> and then use <code>GroupBy.transform</code> for count <code>True</code>s by <code>sum</code>: <pre class="prettyprint"><code>mask = df['value'].gt(3) & df['type'].ne('medium') df['count'] = mask.groupby(df['entity']).transform('sum') </code></pre> Solution with helper column <code>new</code>: <pre class="prettyprint"><code>mask = df['value'].gt(3) & df['type'].ne('medium') df['count'] = df.assign(new = mask).groupby('entity')['new'].transform('sum') </code></pre> <hr> <pre class="prettyprint"><code>print (df) value type entity count 0 1 big R 3 1 2 small R 3 2 3 medium R 3 3 4 big P 4 4 2 big R 3 5 42 big P 4 6 12 big P 4 7 21 medium P 4 8 21 small R 3 9 424 small R 3 10 34 small P 4 11 12 medium R 3 12 42 small R 3 </code></pre>

The solution in Pandas is superb. This is an alternative in a different package. The reason I am throwing this in here is because the original code was in <code>data.table</code> in R, and it might be useful for others, who probably want a similar solution within Python. This is a solution in pydatatable, a library that aims to replicate <code>data.table</code> in python. Note that it is not as feature rich as Pandas; hopefully, with time, more features will be added. Create the frame with <code>datatable</code> : <pre class="prettyprint"><code> from datatable import dt, f, by, update df = dt.Frame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42], 'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'], 'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']}) </code></pre> Create the condition - In datatable, the <code>f</code> symbol is a shortcut to refer to the dataframe: <pre class="prettyprint"><code>condition = (f.type!="medium") & (f.value>3) </code></pre> The syntax below should be familiar to users of <code>data.table</code>, <pre class="prettyprint"><code> DT[i, j, by] </code></pre> where <code>i</code> refers to anything that can occur in the rows, <code>j</code> refers to column operations, and <code>by</code> is for grouping operations. The update function is similar in function to the <code>:=</code> function in <code>data.table</code>; it allows for creation of new columns or update of existing columns in place. <pre class="prettyprint"><code>df[:, update(count=dt.sum(condition)), by('entity')] df value type entity count 0 1 big R 3 1 2 small R 3 2 3 medium R 3 3 4 big P 4 4 2 big R 3 5 42 big P 4 6 12 big P 4 7 21 medium P 4 8 21 small R 3 9 424 small R 3 10 34 small P 4 11 12 medium R 3 12 42 small R 3 </code></pre>

Groupby based on a multiple logical conditions applied to a different columns DataFrame

Tags:

python

pandas

I have this dataframe:

df = pd.DataFrame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})

    value    type  entity
0       1     big       R
1       2   small       R
2       3  medium       R
3       4     big       P
4       2     big       R
5      42     big       P
6      12     big       P
7      21  medium       P
8      21   small       R
9     424   small       R
10     34   small       P
11     12  medium       R
12     42   small       R

The operation consists of grouping by column 'entity' doing a count operation based on a two logical conditions applied to a column 'value' and column 'type'. In my case, I have to count the values greater than 3 in the column 'name' and are not equal to 'medium' in the column 'type'. The result must be R=3 and P=4. After this, I must add the result to the original dataframe creating a new column named ‘Count’. I know this operation can be done in R with the next code:

df[y!='medium' & value>3 , new_var:=.N,by=entity]
df[is.na(new_var),new_var:=0,]
df[,new_var:=max(new_var),by=entity]

In a previous task, I had to calculate only the values greater than 3 as condition. In that case, the result was R=3 and P=4 and I got it applying the next code:

In []:  df.groupby(['entity'])['value'].apply(lambda x: (x>3).sum())

Out[]:  entity
        P    5
        R    4
        Name: value, dtype: int64

In []:  DF=pd.DataFrame(DF)
In []:  DF.reset_index(inplace=True)
In []:  df.merge(DF,on=['entity'],how='inner')
In []:  df=df.rename(columns={'value_x':'value','value_y':'count'},inplace=True)
Out[]:  

    value   type     entity  count
0      1     big          R      4
1      2   small          R      4
2      3  medium          R      4
3      2     big          R      4
4     21   small          R      4
5    424   small          R      4
6     12  medium          R      4
7     42   small          R      4
8      4     big          P      5
9     42     big          P      5
10    12     big          P      5
11    21  medium          P      5
12    34   small          P      5

My questions are: How do I do it for the two conditions case? In fact, How do I do it for a general case with multiples different conditions?

466

asked Oct 06 '20 06:10

JASoto

Video Answer

2 Answers

Create mask by your conditions - here for greater by Series.gt with not equal by Series.ne chained by & for bitwise AND and then use GroupBy.transform for count Trues by sum:

mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = mask.groupby(df['entity']).transform('sum')

Solution with helper column new:

mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = df.assign(new = mask).groupby('entity')['new'].transform('sum')

print (df)
    value    type entity  count
0       1     big      R      3
1       2   small      R      3
2       3  medium      R      3
3       4     big      P      4
4       2     big      R      3
5      42     big      P      4
6      12     big      P      4
7      21  medium      P      4
8      21   small      R      3
9     424   small      R      3
10     34   small      P      4
11     12  medium      R      3
12     42   small      R      3

197

answered Oct 22 '22 02:10

jezrael

The solution in Pandas is superb. This is an alternative in a different package. The reason I am throwing this in here is because the original code was in data.table in R, and it might be useful for others, who probably want a similar solution within Python.

This is a solution in pydatatable, a library that aims to replicate data.table in python. Note that it is not as feature rich as Pandas; hopefully, with time, more features will be added.

Create the frame with datatable :

   from datatable import dt, f, by, update

    df = dt.Frame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})

Create the condition - In datatable, the f symbol is a shortcut to refer to the dataframe:

condition = (f.type!="medium") & (f.value>3)

The syntax below should be familiar to users of data.table,

 DT[i, j, by]

where i refers to anything that can occur in the rows, j refers to column operations, and by is for grouping operations. The update function is similar in function to the := function in data.table; it allows for creation of new columns or update of existing columns in place.

df[:, update(count=dt.sum(condition)), by('entity')]

df

 value  type    entity  count
0   1   big     R       3
1   2   small   R       3
2   3   medium  R       3
3   4   big     P       4
4   2   big     R       3
5   42  big     P       4
6   12  big     P       4
7   21  medium  P       4
8   21  small   R       3
9   424 small   R       3
10  34  small   P       4
11  12  medium  R       3
12  42  small   R       3

answered Oct 22 '22 00:10

sammywemmy

Related questions
                            
                                Avoid global variables for unpicklable shared state among multiprocessing.Pool workers
                            
                                sqlalchemy.pool + psycopg2 timeout issue
                            
                                How to store and use HTML templates in serverless application on AWS Lambda (using AWS SAM)?
                            
                                Write-streaming to Google Cloud Storage in Python
                            
                                airflow dag failed... but all tasks succeeded
                            
                                Pandas dataframe type datetime64[ns] is not working in Hive/Athena
                            
                                OpenCV - How to get real world distance from a 2D image using a chessboard as reference
                            
                                How to make a progress bar on a web page for pandas operation
                            
                                Python Asyncio Task Cancellation
                            
                                How to subset row of condition with some of N rows before the condition meet , more faster than my code?
                            
                                How to test the pytest fixture itself?
                            
                                Imports break VSCode testing with pytest
                            
                                Why in Keras subclassing API, the call method is never called and as an alternative the input is passed by calling the object of this class?
                            
                                vscode "no refactorings available" for python
                            
                                How to run tensorflow inference for multiple models on GPU in parallel?
                            
                                Pycharm Can't retrieve image ID from build stream
                            
                                Google Cloud Functions Deploy "allow unauthenticated invocations..."
                            
                                pydantic: Using property.getter decorator for a field with an alias
                            
                                How to do persistent database connection in FastAPI?
                            
                                How to type-hint a matplotlib.axes._subplots.AxesSubplots object in python3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With