Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

2-dimensional binning with Pandas

So I have two sets of features that I wish to bin (classify) and then combine to create a new feature. It is not unlike classifying coordinates into grids on a map.

The issue is that the features are not evenly distributed and I would like to use quantiles when binning (like with pandas.qcut()) on both features/coordinates.

Is there a better way than doing qcut() on both features and then concatenating the result labels?

like image 831
Reuben L. Avatar asked Apr 15 '17 06:04

Reuben L.


1 Answers

Create a cartesian product categorical.

Consider the dataframe df

df = pd.DataFrame(dict(A=np.random.rand(20), B=np.random.rand(20)))

           A         B
0   0.538186  0.038985
1   0.185523  0.438329
2   0.652151  0.067359
3   0.746060  0.774688
4   0.373741  0.009526
5   0.603536  0.149733
6   0.775801  0.585309
7   0.091238  0.811828
8   0.504035  0.639003
9   0.671320  0.132974
10  0.619939  0.883372
11  0.301644  0.882258
12  0.956463  0.391942
13  0.702457  0.099619
14  0.367810  0.071612
15  0.454935  0.651631
16  0.882029  0.015642
17  0.880251  0.348386
18  0.496250  0.606346
19  0.805688  0.401578

We can create new categoricals with pd.qcut

d1 = df.assign(
    A_cut=pd.qcut(df.A, 2, labels=[1, 2]),
    B_cut=pd.qcut(df.B, 2, labels=list('ab'))
)

           A         B A_cut B_cut
0   0.538186  0.038985     1     a
1   0.185523  0.438329     1     b
2   0.652151  0.067359     2     a
3   0.746060  0.774688     2     b
4   0.373741  0.009526     1     a
5   0.603536  0.149733     1     a
6   0.775801  0.585309     2     b
7   0.091238  0.811828     1     b
8   0.504035  0.639003     1     b
9   0.671320  0.132974     2     a
10  0.619939  0.883372     2     b
11  0.301644  0.882258     1     b
12  0.956463  0.391942     2     a
13  0.702457  0.099619     2     a
14  0.367810  0.071612     1     a
15  0.454935  0.651631     1     b
16  0.882029  0.015642     2     a
17  0.880251  0.348386     2     a
18  0.496250  0.606346     1     b
19  0.805688  0.401578     2     b

You can create the cartesian product categorical with tuples

d2 = d1.assign(cartesian=pd.Categorical(d1.filter(regex='_cut').apply(tuple, 1)))
print(d2)

           A         B A_cut B_cut cartesian
0   0.538186  0.038985     1     a    (1, a)
1   0.185523  0.438329     1     b    (1, b)
2   0.652151  0.067359     2     a    (2, a)
3   0.746060  0.774688     2     b    (2, b)
4   0.373741  0.009526     1     a    (1, a)
5   0.603536  0.149733     1     a    (1, a)
6   0.775801  0.585309     2     b    (2, b)
7   0.091238  0.811828     1     b    (1, b)
8   0.504035  0.639003     1     b    (1, b)
9   0.671320  0.132974     2     a    (2, a)
10  0.619939  0.883372     2     b    (2, b)
11  0.301644  0.882258     1     b    (1, b)
12  0.956463  0.391942     2     a    (2, a)
13  0.702457  0.099619     2     a    (2, a)
14  0.367810  0.071612     1     a    (1, a)
15  0.454935  0.651631     1     b    (1, b)
16  0.882029  0.015642     2     a    (2, a)
17  0.880251  0.348386     2     a    (2, a)
18  0.496250  0.606346     1     b    (1, b)
19  0.805688  0.401578     2     b    (2, b)

If you were so inclined, you could even declare an ordering for them.

like image 111
piRSquared Avatar answered Nov 15 '22 03:11

piRSquared