Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas value_counts with bins applied to a groupby produces incorrect results

Tags:

python

pandas

I can't see why value_counts is giving me the wrong answer. Here is a small example:

In [81]: d=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])

In [82]: d
Out[82]:
   key  score
0    0      0
1    1    100
2    0    100
3    2      0
4    3    100
5    4    100
6    4    100
7    4    100
8    1    100
9    3    100

In [83]: g=d.groupby('key')['score']
In [84]: g.value_counts(bins=[0, 20, 40, 60, 80, 100])
Out[84]:
key  score
0    (-0.001, 20.0]    1
     (20.0, 40.0]      1
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
1    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
2    (-0.001, 20.0]    1
     (20.0, 40.0]      0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
3    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
4    (20.0, 40.0]      3
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
Name: score, dtype: int64

The only values that occur in these data are 0 and 100. But value_counts tells me the range (20.0,40.0] has the most values and (80.0,100.0] has none.

Of course my real data has more values, different keys, etc. but this illustrates the problem I am seeing.

Why?

like image 322
GaryBishop Avatar asked Mar 05 '20 20:03

GaryBishop


People also ask

What does value_counts () do in Pandas?

Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

What difference will it make if the Dropna parameter is set to false in the value_counts () function?

value_counts() displaying the NaN values. By default, the count of null values is excluded from the result. But, the same can be displayed easily by setting the dropna parameter to False . Since our dataset does not have any null values setting dropna parameter would not make a difference.

What is the difference between value_counts and count in Pandas?

count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.

What type does value_counts return?

1. Default parameters. Pandas value_counts() function returns a Series containing counts of unique values. By default, the resulting Series is in descending order without any NA values.


1 Answers

Here is another way of doing it to keep the integrity of the indexes.

d.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])

Output:

key                
0    (80.0, 100.0]     1
     (-0.001, 20.0]    1
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
1    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
2    (-0.001, 20.0]    1
     (80.0, 100.0]     0
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
3    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
4    (80.0, 100.0]     3
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
Name: score, dtype: int64
like image 136
Scott Boston Avatar answered Sep 28 '22 04:09

Scott Boston