<p>Say I have a huge list of numbers between 0 and 100. I compute ranges, depending on the max number and then saying there are 10 bins. So my ranges are for example </p> <pre class="prettyprint"><code>ranges = [0,10,20,30,40,50,60,70,80,90,100] </code></pre> <p>Now I count the occurances in each range from 0-10, 10-20, and so on. I iterate over every number in the list and check for a range. I assume this is not the best way in terms of runtime speed.</p> <p>Can I fasten it up by using pandas, e.g. pandas.groupby, and how? </p>

<p>We can use <code>pd.cut</code> to bin the values into ranges, then we can <code>groupby</code> these ranges, and finally call <code>count</code> to count the values now binned into these ranges:</p> <pre class="prettyprint"><code>np.random.seed(0) df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)}) ranges = [0,10,20,30,40,50,60,70,80,90,100] df.groupby(pd.cut(df.a, ranges)).count() a a (0, 10] 11 (10, 20] 10 (20, 30] 8 (30, 40] 13 (40, 50] 11 (50, 60] 9 (60, 70] 10 (70, 80] 11 (80, 90] 13 (90, 100] 4 </code></pre>

<p>Surprised I haven't seen this yet, so without further ado, here is</p> <h3><code>.value_counts(bins=N)</code></h3> <p>Computing bins with <code>pd.cut</code> followed by a groupBy is a 2-step process. <code>value_counts</code> allows you a shortcut using the <code>bins</code> argument:</p> <pre class="prettyprint"><code># Uses Ed Chum's setup. Cross check our answers match! np.random.seed(0) df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)}) df['a'].value_counts(bins=10, sort=False) (0.9, 10.9] 11 (10.9, 20.8] 10 (20.8, 30.7] 8 (30.7, 40.6] 13 (40.6, 50.5] 11 (50.5, 60.4] 9 (60.4, 70.3] 10 (70.3, 80.2] 11 (80.2, 90.1] 13 (90.1, 100.0] 4 Name: a, dtype: int64 </code></pre> <p>This creates 10 evenly-spaced right-closed intervals and bincounts your data. <code>sort=False</code> will be required to avoid <code>value_counts</code> ordering the result in decreasing order of count.</p> <hr> <h3>Binning by Unequal Ranges</h3> <p>For this, you can pass a list to <code>bins</code> argument:</p> <pre class="prettyprint"><code>bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] df['a'].value_counts(bins=bins, sort=False) (-0.001, 10.0] 11 (10.0, 20.0] 10 (20.0, 30.0] 8 (30.0, 40.0] 13 (40.0, 50.0] 11 (50.0, 60.0] 9 (60.0, 70.0] 10 (70.0, 80.0] 11 (80.0, 90.0] 13 (90.0, 100.0] 4 Name: a, dtype: int64 </code></pre>

Pandas groupby how to compute counts in ranges

Tags:

python

pandas

Say I have a huge list of numbers between 0 and 100. I compute ranges, depending on the max number and then saying there are 10 bins. So my ranges are for example

ranges = [0,10,20,30,40,50,60,70,80,90,100]

Now I count the occurances in each range from 0-10, 10-20, and so on. I iterate over every number in the list and check for a range. I assume this is not the best way in terms of runtime speed.

Can I fasten it up by using pandas, e.g. pandas.groupby, and how?

620

asked Jul 29 '14 07:07

user2366975

2 Answers

We can use pd.cut to bin the values into ranges, then we can groupby these ranges, and finally call count to count the values now binned into these ranges:

np.random.seed(0)
df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)})
ranges = [0,10,20,30,40,50,60,70,80,90,100]
df.groupby(pd.cut(df.a, ranges)).count()

            a
a            
(0, 10]    11
(10, 20]   10
(20, 30]    8
(30, 40]   13
(40, 50]   11
(50, 60]    9
(60, 70]   10
(70, 80]   11
(80, 90]   13
(90, 100]   4

137

answered Oct 28 '22 19:10

EdChum

Surprised I haven't seen this yet, so without further ado, here is

`.value_counts(bins=N)`

Computing bins with pd.cut followed by a groupBy is a 2-step process. value_counts allows you a shortcut using the bins argument:

# Uses Ed Chum's setup. Cross check our answers match!
np.random.seed(0)
df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)})

df['a'].value_counts(bins=10, sort=False)

(0.9, 10.9]      11
(10.9, 20.8]     10
(20.8, 30.7]      8
(30.7, 40.6]     13
(40.6, 50.5]     11
(50.5, 60.4]      9
(60.4, 70.3]     10
(70.3, 80.2]     11
(80.2, 90.1]     13
(90.1, 100.0]     4
Name: a, dtype: int64

This creates 10 evenly-spaced right-closed intervals and bincounts your data. sort=False will be required to avoid value_counts ordering the result in decreasing order of count.

Binning by Unequal Ranges

For this, you can pass a list to bins argument:

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
df['a'].value_counts(bins=bins, sort=False) 

(-0.001, 10.0]    11
(10.0, 20.0]      10
(20.0, 30.0]       8
(30.0, 40.0]      13
(40.0, 50.0]      11
(50.0, 60.0]       9
(60.0, 70.0]      10
(70.0, 80.0]      11
(80.0, 90.0]      13
(90.0, 100.0]      4
Name: a, dtype: int64

answered Oct 28 '22 19:10

cs95

Related questions
                            
                                OSError: [Errno 28] No space left on device Docker, but I have space
                            
                                Python - Round to nearest 05
                            
                                How do I count the trailing zeros in integer?
                            
                                tornado equivalent of delay
                            
                                Using a loop in Python to name variables
                            
                                Region of Interest opencv python
                            
                                how to do circular shift in numpy
                            
                                Is file object in python an iterable
                            
                                How to close the file after pickle.load() in python
                            
                                How to mock/set system date in pytest?
                            
                                Specifying limit and offset in Django QuerySet wont work
                            
                                Why can't the import command be found?
                            
                                Django ignoring DEBUG value when I use os.environ, why?
                            
                                How to configure PIP per config file to use a proxy (with authentification)?
                            
                                Matching Nested Structures With Regular Expressions in Python
                            
                                Convert DD (decimal degrees) to DMS (degrees minutes seconds) in Python?
                            
                                How to test if every item in a list of type 'int'?
                            
                                Utility To Count Number Of Lines Of Code In Python Or Bash
                            
                                How to get around "sys.exit()" in python nosetest?
                            
                                Remove tuple from list of tuples if certain condition is met

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With