I am trying to classify my data in percentile buckets based on their values. My data looks like,
a = pnd.DataFrame(index = ['a','b','c','d','e','f','g','h','i','j'], columns=['data'])
a.data = np.random.randn(10)
print a
print '\nthese are ranked as shown'
print a.rank()
data
a -0.310188
b -0.191582
c 0.860467
d -0.458017
e 0.858653
f -1.640166
g -1.969908
h 0.649781
i 0.218000
j 1.887577
these are ranked as shown
data
a 4
b 5
c 9
d 3
e 8
f 2
g 1
h 7
i 6
j 10
To rank this data, I am using the rank function. However, I am interested in the creating a bucket of the top 20%. In the example shown above, this would be a list containing labels ['c', 'j']
desired result : ['c','j']
How do I get the desired result
Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.
qcut() functionDiscretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
In [13]: df[df > df.quantile(0.8)].dropna()
Out[13]:
data
c 0.860467
j 1.887577
In [14]: list(df[df > df.quantile(0.8)].dropna().index)
Out[14]: ['c', 'j']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With