I can't see why value_counts is giving me the wrong answer. Here is a small example:
In [81]: d=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])
In [82]: d
Out[82]:
key score
0 0 0
1 1 100
2 0 100
3 2 0
4 3 100
5 4 100
6 4 100
7 4 100
8 1 100
9 3 100
In [83]: g=d.groupby('key')['score']
In [84]: g.value_counts(bins=[0, 20, 40, 60, 80, 100])
Out[84]:
key score
0 (-0.001, 20.0] 1
(20.0, 40.0] 1
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
1 (20.0, 40.0] 2
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
2 (-0.001, 20.0] 1
(20.0, 40.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
3 (20.0, 40.0] 2
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
4 (20.0, 40.0] 3
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
Name: score, dtype: int64
The only values that occur in these data are 0 and 100. But value_counts tells me the range (20.0,40.0] has the most values and (80.0,100.0] has none.
Of course my real data has more values, different keys, etc. but this illustrates the problem I am seeing.
Why?
Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
value_counts() displaying the NaN values. By default, the count of null values is excluded from the result. But, the same can be displayed easily by setting the dropna parameter to False . Since our dataset does not have any null values setting dropna parameter would not make a difference.
count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.
1. Default parameters. Pandas value_counts() function returns a Series containing counts of unique values. By default, the resulting Series is in descending order without any NA values.
Here is another way of doing it to keep the integrity of the indexes.
d.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])
Output:
key
0 (80.0, 100.0] 1
(-0.001, 20.0] 1
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
1 (80.0, 100.0] 2
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
2 (-0.001, 20.0] 1
(80.0, 100.0] 0
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
3 (80.0, 100.0] 2
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
4 (80.0, 100.0] 3
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
Name: score, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With