I have a dataframe like this (please discard the first column):
user_id created_at count
1 12136 2017-02-19 4
2 12136 2017-02-16 4
3 12136 2017-02-17 2
4 72349 2017-02-17 8
5 72349 2017-02-19 2
7 72672 2017-02-20 3
8 72672 2017-02-19 2
So, I want to map this values to integer values starting from 0:
12136 -> 0
72349 -> 1
72672 -> 2
And similarly, for the created_at column (starting from the smallest value)
2017-02-16 -> 0
2017-02-17 -> 1
2017-02-19 -> 2
2017-02-20 -> 3
At the end I should have this dataframe (note that 0 values are added for dates where there is no user activity):
user_id created_at count
0 0 4
0 1 2
0 2 4
0 3 0
1 0 0
1 1 8
1 2 2
1 3 0
2 0 0
2 1 0
2 2 2
2 3 3
Also I need to obtain these lists:
label1 = [12136, 72349, 72672]
label2 = ['2017-02-16', '2017-02-17', '2017-02-19', '2017-02-20']
I wonder if there are any methods that could assist me in performing this efficiently?
First, get your lists.
list1 = df.user_id.unique()
print(list1)
array([12136, 72349, 72672])
list2 = df.created_at.unique()
print(list2)
array(['2017-02-19', '2017-02-16', '2017-02-17', '2017-02-20'], dtype=object)
Convert the user_id
and created_at
columns to cat
codes.
df['user_id'] = df['user_id'].astype('category').cat.codes
df['created_at'] = df['created_at'].astype('category').cat.codes
print(df)
user_id created_at count
1 0 2 4
2 0 0 4
3 0 1 2
4 1 1 8
5 1 2 2
7 2 3 3
8 2 2 2
Use a groupby
and a reindex
operation.
df = df.set_index('created_at').groupby('user_id', as_index=False)\
.apply(lambda x: x.reindex(df.created_at.unique()))\
.sort_index().reset_index([1])
Clean up your columns.
df.user_id = df.groupby(level=0).user_id.transform(lambda x: x.ffill().bfill())
df['count'] = df['count'].fillna(0)
print(df.astype(int))
created_at user_id count
0 0 0 4
0 1 0 2
0 2 0 4
0 3 0 0
1 0 1 0
1 1 1 8
1 2 1 2
1 3 1 0
2 0 2 0
2 1 2 0
2 2 2 2
2 3 2 3
You can convert the columns to categories and get a mapping dictionary
df['user_id']= df['user_id'].astype('category')
label1 = dict(enumerate(df['user_id'].cat.categories))
df['created_at']= df['created_at'].astype('category')
label2 = dict(enumerate(df['created_at'].cat.categories))
Now convert the columns values to category codes
df[['user_id', 'created_at']] = df[['user_id', 'created_at']].apply(lambda x: x.cat.codes)
You get
user_id created_at count
1 0 2 4
2 0 0 4
3 0 1 2
4 1 1 8
5 1 2 2
7 2 3 3
8 2 2 2
label1
{0: 12136, 1: 72349, 2: 72672}
label2
{0: '2017-02-16', 1: '2017-02-17', 2: '2017-02-19', 3: '2017-02-20'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With