I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –
If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count()
and when does it make sense to use df['colA'].value_counts()
?
Pandas count value for each row and columns using the dataframe count() function. Count for each level in a multi-index dataframe. Pandas value_counts() method to find frequency of unique values in a series.
The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.
value_counts() function to find the values counts of each unique value in the given Series object. Output : As we can see in the output, the Series. value_counts() function has returned the value counts of each unique value in the given Series object.
There is difference value_counts
return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count
not, it sort output by index
(created by column in groupby('col')
).
df.groupby('colA').count()
is for aggregate all columns of df
by function count.
So it count values excluding NaN
s.
So if need count
only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
Groupby
and value_counts
are totally different functions. You cannot perform value_counts on a dataframe.
Value Counts
are limited only for a single column or series and it's sole purpose is to return the series of frequencies of values
Groupby
returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count()
it will return the number of true values present in columns with respect to the specific columns
in groupby.
When should be value_counts
used and when should groupby.count
be used :
Lets take an example
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values present in all the columns
with reference to
orwith respect to
one or more columns specified. So not a number (nan) will be excluded.
To find the frequency using groupby you need to aggregate against the specified column itself like @jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
Value Counts:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values present in one particular column.
In conclusion :
.groupby(col).count()
should be used when you want to find the frequency of valid values present in columns with respect to specified col
.
.value_counts()
should be used to find the frequencies of a series.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With