Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Group by a column that meets a condition

I have a data set with three colums: rating , breed, and dog.

import pandas as pd
dogs = {'breed': ['Chihuahua', 'Chihuahua', 'Dalmatian', 'Sphynx'],
        'dog': [True, True, True, False],
        'rating': [8.0, 9.0, 10.0, 7.0]}

df = pd.DataFrame(data=dogs)

I would like to calculate the mean rating per breed where dog is True. This would be the expected:

  breed     rating
0 Chihuahua 8.5   
1 Dalmatian 10.0  

This has been my attempt:

df.groupby('breed')['rating'].mean().where(dog == True)

And this is the error that I get:

NameError: name 'dog' is not defined

But when I try add the where condition I only get errors. Can anyone advise a solution? TIA

like image 388
seisgradox Avatar asked Jun 03 '18 01:06

seisgradox


People also ask

How do you use Groupby with condition?

This particular syntax groups the rows of the DataFrame based on var1 and then counts the number of rows where var2 is equal to 'val. ' The following example shows how to use this syntax in practice.

How do I Group column values in pandas?

groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to . groupby() as the first argument.

What does .values in pandas do?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.


Video Answer


2 Answers

Once you groupby and select a column, your dog column doesn't exist anymore in the context you have selected (and even if it did you are not accessing it correctly).

Filter your dataframe first, then use groupby with mean

df[df.dog].groupby('breed')['rating'].mean().reset_index()

       breed  rating
0  Chihuahua     8.5
1  Dalmatian    10.0
like image 174
user3483203 Avatar answered Oct 16 '22 23:10

user3483203


An alternative solution is to make dog one of your grouper keys. Then filter by dog in a separate step. This is more efficient if you do not want to lose aggregated data for non-dogs.

res = df.groupby(['dog', 'breed'])['rating'].mean().reset_index()

print(res)

     dog      breed  rating
0  False     Sphynx     7.0
1   True  Chihuahua     8.5
2   True  Dalmatian    10.0

print(res[res['dog']])

    dog      breed  rating
1  True  Chihuahua     8.5
2  True  Dalmatian    10.0
like image 1
jpp Avatar answered Oct 16 '22 22:10

jpp