Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping numerical values in pandas

Tags:

python

pandas

In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).

Doing simple: df.distance.count_values() returns:

74         1
90         1
94         1
893        1
889        1
885        1
877        1
833        1
122        1
545        1

What I want to achieve is something like buckets from histogram, so I am expecting output like this:

900         4 #all values < 900 and > 850
100         3
150         1
550         1
850         1

The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.

Is there any other, better approach for that?

like image 249
sebap123 Avatar asked Sep 03 '25 05:09

sebap123


2 Answers

I'd suggest doing the following, assuming your value column is labeled val

import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))

The result is the following:

df.groupby('bin')['val'].count()
like image 191
DrTRD Avatar answered Sep 04 '25 17:09

DrTRD


Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:

import numpy as np

step = 50

#...

max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])
like image 24
sebap123 Avatar answered Sep 04 '25 17:09

sebap123