I have a df that looks likes this
var1 var2 var3 var4 var5 var6
0 0.3 0.6 0.7 0.8 0.7 0.5
1 0.7 0.6 0.4 0.6 0.7 1.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.1 0.9 0.5 0.7 0.7 0.9
4 0.3 2.3 0.4 2.0 1.9 1.9
5 4.0 1.2 0.6 1.2 2.6 3.1
6 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.2 0.1 0.2 0.2 0.2
8 0.1 0.1 0.1 0.1 0.1 0.1
9 0.0 0.0 0.0 0.0 0.0 0.0
10 0.1 0.1 0.1 0.2 0.1 0.1
11 0.0 0.0 0.0 0.0 0.0 0.1
12 0.0 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 0.0 0.0 0.0
I want to create 4 bins (strictly 4 bins) for every column so i apply the pandas cut function in each column separately. So I do
import pandas as pd
qt = so.apply(lambda x: pd.cut(x,4))
Then if I do
qt.var1.unique()
I get
[(-0.004, 1.0], (3.0, 4.0]]
Categories (2, interval[float64]): [(-0.004, 1.0] < (3.0, 4.0]]
Which has only 2 categories.
Any ideas why this happens ?
For var1 you split the data in equal-width bins in the range of var1. So you have a range from 0 to 4 so you get the intervals:
Categories (4, interval[float64]): [(-0.004, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]
unique only shows 2, because there are only values in 2 of the 4 intervals.
Explanation for -0.004:
The range of x is extended by .1% on each side to include the minimum and maximum values of x.
The documentation specify that the bins have the same width:
Defines the number of equal-width bins in the range of x...
In your case, you can not create 4 equal bins to fit your data in. Here an example:
>>> a = np.arange(12)
>>> print(len(pd.cut(a, 4).unique()))
4
>>> b = np.array([1,2,3, 10, 20])
>>> print(len(pd.cut(b, 4).unique()))
3
As you can see, in the latter case only 4 bins are created, but only 3 are used
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With