Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create open bounds indicators from pandas get_dummies on discretized numerical

Tags:

python

pandas

From a numeric age pandas column, discretize as ageD with qcut, we create open bounds from the qcut bounds:

import pandas as pd
from itertools import chain

d = {'age':  {0: 5, 1: 23, 2: 43, 3: 70, 4: 30}}
df = pd.DataFrame.from_dict(d)
df['ageD'] = pd.qcut(df.iloc[:, 0], 2)
df.ageD.cat.categories
# Index([u'[5, 30]', u'(30, 70]'], dtype='object')

From Index([u'[5, 30]', u'(30, 70]'], dtype='object') we make bopens:

>>> bopens = get_open_bounds(df)
>>> bopens
# ['(-inf, 5]', '(-inf, 30]', '(-inf, 70]', '(5, +inf)', '(30, +inf)', '(70, +inf)']

Then we convert categorical variable into dummy/indicator variables with get_dummies:

df = pd.get_dummies(df)
print df
#    age  ageD_[5, 30]  ageD_(30, 70]
# 0    5             1              0
# 1   23             1              0
# 2   43             0              1
# 3   70             0              1
# 4   30             1              0

I want to enrich the data frame with the open bounds columns, df.shape will be quite big, ~(10e6, 32). What is the best way to make for each line the 6 bopen cols ?

The target df will look like this one:

>>> df
   age age_[5, 30]  age_(30, 70]  (-inf, 5] (-inf, 30] (-inf, 70] (5, +inf) (30, +inf) (70, +inf)
0    5           1             0          1          1          1         0          0          0
1   23           1             0          0          1          1         1          0          0
2   43           0             1          0          0          1         1          1          0
3   70           0             1          0          0          1         1          1          0
4   30           1             0          0          1          1         1          0          0

PS: the get_open_bounds used to make bopens:

def get_open_bounds(df):
    bounds = [(int(x[1:]), int(y[:-1])) for x, y in
             [c.split(', ') for c in df.ageD.cat.categories]]
    bounds = list(chain(*bounds))
    bounds
    # [5, 30, 30, 70]

    # to get uniques, keeping the order
    bounds = [b for idx, b in enumerate(bounds) if b not in bounds[:idx]]

    # make the open bounds
    bopens = ["(-inf, {}]".format(b) for b in bounds] + \
             ["({}, +inf)".format(b) for b in bounds]
    return bopens
like image 633
user3313834 Avatar asked Oct 31 '22 12:10

user3313834


1 Answers

IIUC, you can do this with a bit of broadcasting:

df['ageD'], bins = pd.qcut(df.iloc[:, 0], 2, retbins=True)
left = (df["age"].values <= bins[:,None]).T.astype(int)
dl = pd.DataFrame(left, columns=["(-inf, {}]".format(b) for b in bins])
dr = pd.DataFrame(1-left, columns=["({}, +inf)".format(b) for b in bins])
dout = pd.concat([pd.get_dummies(df), dl, dr], axis=1)

gives me

>>> dout
   age  ageD_[5, 30]  ageD_(30, 70]  (-inf, 5]  (-inf, 30]  (-inf, 70]  (5, +inf)  (30, +inf)  (70, +inf)
0    5             1              0          1           1           1          0           0           0
1   23             1              0          0           1           1          1           0           0
2   43             0              1          0           0           1          1           1           0
3   70             0              1          0           0           1          1           1           0
4   30             1              0          0           1           1          1           0           0

Note #1: by adding retbins = True I can get the bins themselves and avoid some awkward string parsing.

Note #2: by doing an implicit "right = 1 - left", I'm assuming that no ages are NaN and so one of >= or < must be true; if that's not certain, you could do right = (df["age"].values > bins[:,None].T.astype(int) instead.)

Note #3: really I should be passing the frame constructors df.index as well -- while your example had the canonical index, that may not be true in your real data.

like image 154
DSM Avatar answered Nov 08 '22 07:11

DSM