I am using a recurrent neural network to consume time-series events (click stream). My data needs to be formatted such that a each row contains all the events for an id. My data is one-hot encoded, and I have already grouped it by the id. Also I limit the total number of events per id (ex. 2), so final width will always be known (#one-hot cols x #events). I need to maintain the order of the events, because they are ordered by time.
Current data state:
id page.A page.B page.C
0 001 0 1 0
1 001 1 0 0
2 002 0 0 1
3 002 1 0 0
Required data state:
id page.A1 page.B1 page.C1 page.A2 page.B2 page.C2
0 001 0 1 0 1 0 0
1 002 0 0 1 1 0 1
This looks like a pivot
problem to me, but my resulting dataframes are not in the format I need. Any suggestions on how I should approach this?
The idea here is to reset_index
within each group of 'id'
to get a count which row of that particular 'id'
we are at. Then follow that up with unstack
and sort_index
to get columns where they are supposed to be.
Finally, flatten the multiindex.
df1 = df.set_index('id').groupby(level=0) \
.apply(lambda df: df.reset_index(drop=True)) \
.unstack().sort_index(axis=1, level=1) # Thx @jezrael for sort reminder
df1.columns = ['{}{}'.format(x[0], int(x[1]) + 1) for x in df1.columns]
df1
You can first create new column with cumcount
for new column name, then set_index
and unstack
. Then you need sort columns in level 1
by sort_index
, remove MultiIndex
from columns by list comprehension
and last reset_index
:
df['g'] = (df.groupby('id').cumcount() + 1).astype(str)
df1 = df.set_index(['id','g']).unstack()
df1.sort_index(axis=1,level=1, inplace=True)
df1.columns = [''.join(col) for col in df1.columns]
df1.reset_index(inplace=True)
print (df1)
id page.A1 page.B1 page.C1 page.A2 page.B2 page.C2
0 1 0 1 0 1 0 0
1 2 0 0 1 1 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With