Starting with this sample data...
import pandas as pd
start_data = {"person_id": [1, 1, 1, 1, 2], "nid": [1, 2, 3, 4, 1],
"beg": ["Jan 1 2018", "Jan 5 2018", "Jan 10 2018", "Feb 5 2018", "Jan 25 2018"],
"end": ["Feb 1 2018", "Mar 4 2018", "", "Oct 18 2018", "Nov 10 2018"]}
df = pd.DataFrame(start_data)
df["beg"] = pd.to_datetime(df["beg"])
df["end"] = pd.to_datetime(df["end"])
Starting point:
person_id nid beg end
0 1 1 2018-01-01 2018-02-01
1 1 2 2018-01-05 2018-03-04
2 1 3 2018-01-10 NaT
3 1 4 2018-02-05 2018-10-18
4 2 1 2018-01-25 2018-11-10
Goal output:
person_id date 1 2 3 4
1 2018-01-01 1 0 0 0
1 2018-01-05 1 1 0 0
1 2018-01-10 1 1 1 0
1 2018-02-01 0 1 1 0
1 2018-02-05 0 1 1 1
1 2018-03-04 0 0 1 1
1 2018-10-18 0 0 1 0
2 2018-01-25 1 0 0 0
2 2018-11-10 0 0 0 0
I am trying to tie all active nid
's to the associated person_id
This will then be joined to another dataframe based on the latest date
less than a dated activity column. And finally this will be part of the input into a predictive model.
Doing something like pd.get_dummies(df["nid"])
get's this output:
1 2 3 4
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
So this needs to be moved to a different index representing effective date, grouped by person_id
, and then aggregated to match the goal output.
Special bonus to anyone who can come up with an approach that would properly leverage Dask. This is what we are using for other parts of the pipeline due to the scalability. This may be a pipe-dream but I thought I would send it out to see what comes back.
The question is hard , I can only think of numpy
broadcast to speed up the for loop
s=df.set_index('person_id')[['beg','end']].stack()
l=[]
for x , y in df.groupby('person_id'):
y=y.fillna({'end':y.end.max()})
s1=y.beg.values
s2=y.end.values
t=s.loc[x].values
l.append(pd.DataFrame(((s1-t[:,None]).astype(float)<=0)&((s2-t[:,None]).astype(float)>0),columns=y.nid,index=s.loc[[x]].index))
s=pd.concat([s,pd.concat(l).fillna(0).astype(int)],1).reset_index(level=0).sort_values(['person_id',0])
s
Out[401]:
person_id 0 1 2 3 4
beg 1 2018-01-01 1 0 0 0
beg 1 2018-01-05 1 1 0 0
beg 1 2018-01-10 1 1 1 0
end 1 2018-02-01 0 1 1 0
beg 1 2018-02-05 0 1 1 1
end 1 2018-03-04 0 0 1 1
end 1 2018-10-18 0 0 0 0
beg 2 2018-01-25 1 0 0 0
end 2 2018-11-10 0 0 0 0
Similar to @WenYoBen's approach, a little different in broadcasting and return:
def onehot(group):
pid, g = group
ends = g.end.fillna(g.end.max())
begs = g.beg
days = pd.concat((ends,begs)).sort_values().unique()
ret = pd.DataFrame((days[:,None] < ends.values) & (days[:,None]>= begs.values),
columns= g.nid)
ret['persion_id'] = pid
return ret
new_df = pd.concat([onehot(group) for group in df.groupby('person_id')], sort=False)
new_df.fillna(0).astype(int)
Output:
1 2 3 4 persion_id
0 1 0 0 0 1
1 1 1 0 0 1
2 1 1 1 0 1
3 0 1 1 0 1
4 0 1 1 1 1
5 0 0 1 1 1
6 0 0 0 0 1
0 1 0 0 0 2
1 0 0 0 0 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With