Effective-Date-Range One-Hot-Encode groupby

Question

Starting with this sample data...

import pandas as pd

start_data = {"person_id": [1, 1, 1, 1, 2], "nid": [1, 2, 3, 4, 1],
              "beg": ["Jan 1 2018", "Jan 5 2018", "Jan 10 2018", "Feb 5 2018", "Jan 25 2018"],
              "end": ["Feb 1 2018", "Mar 4 2018", "", "Oct 18 2018", "Nov 10 2018"]}
df = pd.DataFrame(start_data)
df["beg"] = pd.to_datetime(df["beg"])
df["end"] = pd.to_datetime(df["end"])

Starting point:

   person_id  nid        beg        end
0          1    1 2018-01-01 2018-02-01
1          1    2 2018-01-05 2018-03-04
2          1    3 2018-01-10        NaT
3          1    4 2018-02-05 2018-10-18
4          2    1 2018-01-25 2018-11-10

Goal output:

person_id date       1 2 3 4
        1 2018-01-01 1 0 0 0
        1 2018-01-05 1 1 0 0
        1 2018-01-10 1 1 1 0
        1 2018-02-01 0 1 1 0
        1 2018-02-05 0 1 1 1
        1 2018-03-04 0 0 1 1
        1 2018-10-18 0 0 1 0 
        2 2018-01-25 1 0 0 0
        2 2018-11-10 0 0 0 0

I am trying to tie all active nid's to the associated person_id This will then be joined to another dataframe based on the latest date less than a dated activity column. And finally this will be part of the input into a predictive model.

Doing something like pd.get_dummies(df["nid"]) get's this output:

   1  2  3  4
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
4  1  0  0  0

So this needs to be moved to a different index representing effective date, grouped by person_id, and then aggregated to match the goal output.

Special bonus to anyone who can come up with an approach that would properly leverage Dask. This is what we are using for other parts of the pipeline due to the scalability. This may be a pipe-dream but I thought I would send it out to see what comes back.

BENY · Accepted Answer

The question is hard , I can only think of numpy broadcast to speed up the for loop

s=df.set_index('person_id')[['beg','end']].stack()
l=[]
for x , y in df.groupby('person_id'):
    y=y.fillna({'end':y.end.max()})
    s1=y.beg.values
    s2=y.end.values
    t=s.loc[x].values
    l.append(pd.DataFrame(((s1-t[:,None]).astype(float)<=0)&((s2-t[:,None]).astype(float)>0),columns=y.nid,index=s.loc[[x]].index))
s=pd.concat([s,pd.concat(l).fillna(0).astype(int)],1).reset_index(level=0).sort_values(['person_id',0])
s
Out[401]: 
     person_id          0  1  2  3  4
beg          1 2018-01-01  1  0  0  0
beg          1 2018-01-05  1  1  0  0
beg          1 2018-01-10  1  1  1  0
end          1 2018-02-01  0  1  1  0
beg          1 2018-02-05  0  1  1  1
end          1 2018-03-04  0  0  1  1
end          1 2018-10-18  0  0  0  0
beg          2 2018-01-25  1  0  0  0
end          2 2018-11-10  0  0  0  0

Quang Hoang · Answer

Similar to @WenYoBen's approach, a little different in broadcasting and return:

def onehot(group):
    pid, g = group

    ends = g.end.fillna(g.end.max())
    begs = g.beg

    days = pd.concat((ends,begs)).sort_values().unique()

    ret = pd.DataFrame((days[:,None] < ends.values) & (days[:,None]>= begs.values),
                    columns= g.nid)
    ret['persion_id'] = pid
    return ret


new_df = pd.concat([onehot(group) for group in df.groupby('person_id')], sort=False)
new_df.fillna(0).astype(int)

Output:

    1   2   3   4   persion_id
0   1   0   0   0   1
1   1   1   0   0   1
2   1   1   1   0   1
3   0   1   1   0   1
4   0   1   1   1   1
5   0   0   1   1   1
6   0   0   0   0   1
0   1   0   0   0   2
1   0   0   0   0   2

Effective-Date-Range One-Hot-Encode groupby

Tags:

python

pandas

pandas-groupby

dask

Chris Farr

2 Answers

BENY

Quang Hoang

Recent Activity

Donate For Us

Effective-Date-Range One-Hot-Encode groupby

Tags:

python

pandas

pandas-groupby

dask

Chris Farr

2 Answers

BENY

Quang Hoang

Related questions

Recent Activity

Donate For Us