pandas value_counts with bins applied to a groupby produces incorrect results

Tags:

python

pandas

I can't see why value_counts is giving me the wrong answer. Here is a small example:

In [81]: d=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])

In [82]: d
Out[82]:
   key  score
0    0      0
1    1    100
2    0    100
3    2      0
4    3    100
5    4    100
6    4    100
7    4    100
8    1    100
9    3    100

In [83]: g=d.groupby('key')['score']
In [84]: g.value_counts(bins=[0, 20, 40, 60, 80, 100])
Out[84]:
key  score
0    (-0.001, 20.0]    1
     (20.0, 40.0]      1
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
1    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
2    (-0.001, 20.0]    1
     (20.0, 40.0]      0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
3    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
4    (20.0, 40.0]      3
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
Name: score, dtype: int64

The only values that occur in these data are 0 and 100. But value_counts tells me the range (20.0,40.0] has the most values and (80.0,100.0] has none.

Of course my real data has more values, different keys, etc. but this illustrates the problem I am seeing.

Why?

322

asked Mar 05 '20 20:03

GaryBishop

1 Answers

Here is another way of doing it to keep the integrity of the indexes.

d.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])

Output:

key                
0    (80.0, 100.0]     1
     (-0.001, 20.0]    1
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
1    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
2    (-0.001, 20.0]    1
     (80.0, 100.0]     0
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
3    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
4    (80.0, 100.0]     3
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
Name: score, dtype: int64

136

answered Sep 28 '22 04:09

Scott Boston

Related questions
                            
                                How to get probability of prediction per entity from Spacy NER model?
                            
                                How to find code that is missing type annotations?
                            
                                Multi-Page Dash App Callbacks Not Registering
                            
                                installing spyder_autopep8 on spyder 4 and getting it to work
                            
                                PyOpenGL how do I import an obj file?
                            
                                RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python
                            
                                How to run python on GPU with CuPy?
                            
                                Airflow: Proper way to run DAG for each file
                            
                                Value filter in pandas dataframe keeping NaN
                            
                                How to log from a custom ai platform model
                            
                                Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?
                            
                                ValueError: "cannot reindex from a duplicate axis" in groupby Pandas
                            
                                Custom scikit-learn scorer can't access mean after fit
                            
                                Handling empty arrays in pySpark (optional binary element (UTF8) is not a group)
                            
                                Best approach for data conversion/mapping [closed]
                            
                                How can I improve the performance of my script?
                            
                                Difficulty combining and repositioning the legends of two charts in matplotlib and pandas
                            
                                Python script to convert octal to string eg octal(755) to (rwxr-xr-x). Stuck at adding the dash seperator
                            
                                How do I make an inverse filled transparent rectangle with OpenCV?
                            
                                Is it possible to change code of flask without rerunning the flask server after deployment?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With