Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Scikit Learn dictvectorizer to get encoded dataframe from dense dataframe in Python?

I have a dataframe as follows:

   user  item  affinity
0     1    13       0.1
1     2    11       0.4
2     3    14       0.9
3     4    12       1.0

From this I want to create an encoded dataset (for fastFM) as follows:

  user1 user2 user4 user4 item11 item12 item13 item14 affinity
    1     0     0     0     0      0      1      0       0.1
    0     1     0     0     1      0      0      0       0.4
    0     0     1     0     0      0      0      1       0.9
    0     0     0     1     0      1      0      0       1.0

Do I need a dictvectorizer from sklearn? If yes, then is there a way to convert original dataframe to dictionary which can be given to dictvectorizer which will in turn give me the encoded dataset as shown?

like image 272
exAres Avatar asked May 03 '26 20:05

exAres


1 Answers

You can use get_dummies with concat If values in columns user or item are numeric, cast to string by astype:

df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12}, 
                   'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
                   'user': {0: 1, 1: 2, 2: 3, 3: 4}},
                    columns=['user','item','affinity'])
print df
   user  item  affinity
0     1    13       0.1
1     2    11       0.4
2     3    14       0.9
3     4    12       1.0

df1 = df.user.astype(str).str.get_dummies()
df1.columns = ['user' + str(x) for x in df1.columns]
print df1
   user1  user2  user3  user4
0      1      0      0      0
1      0      1      0      0
2      0      0      1      0
3      0      0      0      1

df2 = df.item.astype(str).str.get_dummies()
df2.columns = ['item' + str(x) for x in df2.columns]
print df2
   item11  item12  item13  item14
0       0       0       1       0
1       1       0       0       0
2       0       0       0       1
3       0       1       0       0

print pd.concat([df1,df2, df.affinity], axis=1)
   user1  user2  user3  user4  item11  item12  item13  item14  affinity
0      1      0      0      0       0       0       1       0       0.1
1      0      1      0      0       1       0       0       0       0.4
2      0      0      1      0       0       0       0       1       0.9
3      0      0      0      1       0       1       0       0       1.0

Timings:

len(df) = 4:

In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 690 µs per loop

len(df) = 40:

df = pd.concat([df]*10).reset_index(drop=True)

In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 719 µs per loop

len(df) = 400:

df = pd.concat([df]*100).reset_index(drop=True)

In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 748 µs per loop

len(df) = 4k:

df = pd.concat([df]*1000).reset_index(drop=True)

In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 761 µs per loop

len(df) = 40k:

df = pd.concat([df]*10000).reset_index(drop=True)

%timeit pd.concat([df1,df2, df.affinity], axis=1)
1000 loops, best of 3: 1.83 ms per loop

len(df) = 400k:

df = pd.concat([df]*100000).reset_index(drop=True)

%timeit pd.concat([df1,df2, df.affinity], axis=1)
100 loops, best of 3: 15.6 ms per loop
like image 108
jezrael Avatar answered May 06 '26 09:05

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!