Say I have a huge list of numbers between 0 and 100. I compute ranges, depending on the max number and then saying there are 10 bins. So my ranges are for example
ranges = [0,10,20,30,40,50,60,70,80,90,100]
Now I count the occurances in each range from 0-10, 10-20, and so on. I iterate over every number in the list and check for a range. I assume this is not the best way in terms of runtime speed.
Can I fasten it up by using pandas, e.g. pandas.groupby, and how?
Use count() by Column Name Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values.
You can apply a function to each row of the DataFrame with apply method. In the applied function, you can first transform the row into a boolean array using between method or with standard relational operators, and then count the True values of the boolean array with sum method.
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
We can use pd.cut
to bin the values into ranges, then we can groupby
these ranges, and finally call count
to count the values now binned into these ranges:
np.random.seed(0)
df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)})
ranges = [0,10,20,30,40,50,60,70,80,90,100]
df.groupby(pd.cut(df.a, ranges)).count()
a
a
(0, 10] 11
(10, 20] 10
(20, 30] 8
(30, 40] 13
(40, 50] 11
(50, 60] 9
(60, 70] 10
(70, 80] 11
(80, 90] 13
(90, 100] 4
Surprised I haven't seen this yet, so without further ado, here is
.value_counts(bins=N)
Computing bins with pd.cut
followed by a groupBy is a 2-step process. value_counts
allows you a shortcut using the bins
argument:
# Uses Ed Chum's setup. Cross check our answers match!
np.random.seed(0)
df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)})
df['a'].value_counts(bins=10, sort=False)
(0.9, 10.9] 11
(10.9, 20.8] 10
(20.8, 30.7] 8
(30.7, 40.6] 13
(40.6, 50.5] 11
(50.5, 60.4] 9
(60.4, 70.3] 10
(70.3, 80.2] 11
(80.2, 90.1] 13
(90.1, 100.0] 4
Name: a, dtype: int64
This creates 10 evenly-spaced right-closed intervals and bincounts your data. sort=False
will be required to avoid value_counts
ordering the result in decreasing order of count.
For this, you can pass a list to bins
argument:
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
df['a'].value_counts(bins=bins, sort=False)
(-0.001, 10.0] 11
(10.0, 20.0] 10
(20.0, 30.0] 8
(30.0, 40.0] 13
(40.0, 50.0] 11
(50.0, 60.0] 9
(60.0, 70.0] 10
(70.0, 80.0] 11
(80.0, 90.0] 13
(90.0, 100.0] 4
Name: a, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With