Can I make pandas cut/qcut function to return with bin endpoint or bin midpoint instead of a string of bin label?
Currently
pd.cut(pd.Series(np.arange(11)), bins = 5)
0 (-0.01, 2]
1 (-0.01, 2]
2 (-0.01, 2]
3 (2, 4]
4 (2, 4]
5 (4, 6]
6 (4, 6]
7 (6, 8]
8 (6, 8]
9 (8, 10]
10 (8, 10]
dtype: category
with category / string values. What I want is
0 1.0
1 1.0
2 1.0
3 3.0
4 3.0
with numerical values representing edge or midpoint of the bin.
Qcut (quantile-cut) differs from cut in the sense that, in qcut, the number of elements in each bin will be roughly the same, but this will come at the cost of differently sized interval widths.
Quantile-based discretization function. Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.
I noticed that a category has a mid
property, so you can calculate the middle via an apply
:
In [1]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame({"val":np.arange(11)})
...: df["bins"] = pd.cut(df["val"], bins = 5)
...: df["bin_centres"] = df["bins"].apply(lambda x: x.mid)
...: df
Out[1]:
val bins bin_centres
0 0 (-0.01, 2.0] 0.995
1 1 (-0.01, 2.0] 0.995
2 2 (-0.01, 2.0] 0.995
3 3 (2.0, 4.0] 3.000
4 4 (2.0, 4.0] 3.000
5 5 (4.0, 6.0] 5.000
6 6 (4.0, 6.0] 5.000
7 7 (6.0, 8.0] 7.000
8 8 (6.0, 8.0] 7.000
9 9 (8.0, 10.0] 9.000
10 10 (8.0, 10.0] 9.000
I see that this is an old post but I will take the liberty to answer it anyway.
It is now possible (ref @chrisb's answer) to access the endpoints for categorical intervals using left
and right
.
s = pd.cut(pd.Series(np.arange(11)), bins = 5)
mid = [(a.left + a.right)/2 for a in s]
Out[34]: [0.995, 0.995, 0.995, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]
Since intervals are open to the left and closed to the right, the 'first' interval (the one starting at 0), actually starts at -0.01. To get a midpoint using 0 as the left value you can do this
mid_alt = [(a.left + a.right)/2 if a.left != -0.01 else a.right/2 for a in s]
Out[35]: [1.0, 1.0, 1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]
Or, you can say that the intervals are closed to the left and open to the right
t = pd.cut(pd.Series(np.arange(11)), bins = 5, right=False)
Out[38]:
0 [0.0, 2.0)
1 [0.0, 2.0)
2 [2.0, 4.0)
3 [2.0, 4.0)
4 [4.0, 6.0)
5 [4.0, 6.0)
6 [6.0, 8.0)
7 [6.0, 8.0)
8 [8.0, 10.01)
9 [8.0, 10.01)
10 [8.0, 10.01)
But, as you see, you get the same problem at the last interval.
There's a work-in-progress proposal for an 'IntervalIndex' that would make this type of operation very straightforward.
But for now, you can get the bins by passing the retbins
argument and calculate the midpoints.
In [8]: s, bins = pd.cut(pd.Series(np.arange(11)), bins = 5, retbins=True)
In [11]: mid = [(a + b) /2 for a,b in zip(bins[:-1], bins[1:])]
In [13]: s.cat.rename_categories(mid)
Out[13]:
0 0.995
1 0.995
2 0.995
3 3.000
4 3.000
5 5.000
6 5.000
7 7.000
8 7.000
9 9.000
10 9.000
dtype: category
Categories (5, float64): [0.995 < 3.000 < 5.000 < 7.000 < 9.000]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With