For a recommendation service I am training a matrix factorization model (LightFM) on a set of user-item interactions. For the matrix factorization model to yield the best results, I need to map my user and item IDs to a continuous range of integer IDs starting at 0.
I'm using a pandas DataFrame in the process, and I have found a MultiIndex to be extremely convenient to create this mapping, like so:
ratings = [{'user_id': 1, 'item_id': 1, 'rating': 1.0},
{'user_id': 1, 'item_id': 3, 'rating': 1.0},
{'user_id': 3, 'item_id': 1, 'rating': 1.0},
{'user_id': 3, 'item_id': 3, 'rating': 1.0}]
df = pd.DataFrame(ratings, columns=['user_id', 'item_id', 'rating'])
df = df.set_index(['user_id', 'item_id'])
df
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 1 1.0
And then allows me to get the continuous maps like so
df.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1], dtype='int8')
df.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1], dtype='int8')
Afterwards, I can map them back using df.index.levels[0].get_loc
method. Great!
But, now I'm trying to streamline my model training process, ideally by training it incrementally on new data, preserving the old ID mappings. Something like:
new_ratings = [{'user_id': 2, 'item_id': 1, 'rating': 1.0},
{'user_id': 2, 'item_id': 2, 'rating': 1.0}]
df2 = pd.DataFrame(new_ratings, columns=['user_id', 'item_id', 'rating'])
df2 = df2.set_index(['user_id', 'item_id'])
df2
Out:
rating
user_id item_id
2 1 1.0
2 2 1.0
Then, simply appending the new ratings to the old DataFrame
df3 = df.append(df2)
df3
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 3 1.0
2 1 1.0
2 2 1.0
Looks good, but
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 2, 2, 1, 1], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 2, 0, 2, 0, 1], dtype='int8')
I added user_id=2 and item_id=2 in the later DataFrame on purpose, to illustrate where it goes wrong for me. In df3
, labels 3 (for both user and item), have moved from integer position 1 to 2. So the mapping is no longer the same. What I'm looking for is [0, 0, 1, 1, 2, 2]
and [0, 1, 0, 1, 0, 2]
for user and item mappings respectively.
This is probably because of ordering in pandas Index objects, and I'm unsure if what I want is at all possible using a MultiIndex strategy. Looking for help on how most to effectively tackle this problem :)
Some notes:
I've made a modification to @jpp's answer to satisfy the additional requirement I've added later (tagged with EDIT). This also truly satisfies the original question as posed in the title, since it preserves the old index integer positions, regardless of rows being reordered for whatever reason. I've also wrapped things into functions:
from itertools import chain
from toolz import unique
def expand_index(source, target, index_cols=['user_id', 'item_id']):
# Elevate index to series, keeping source with index
temp = source.reset_index()
target = target.reset_index()
# Convert columns to categorical, using the source index and target columns
for col in index_cols:
i = source.index.names.index(col)
col_cats = list(unique(chain(source.index.levels[i], target[col])))
temp[col] = pd.Categorical(temp[col], categories=col_cats)
target[col] = pd.Categorical(target[col], categories=col_cats)
# Convert series back to index
source = temp.set_index(index_cols)
target = target.set_index(index_cols)
return source, target
def concat_expand_index(old, new):
old, new = expand_index(old, new)
return pd.concat([old, new])
df3 = concat_expand_index(df, df2)
The result:
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1, 2, 2], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1, 0, 2], dtype='int8')
from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.
One can reindex a single column or multiple columns by using reindex() method and by specifying the axis we want to reindex. Default values in the new index that are not present in the dataframe are assigned NaN.
pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero.
The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.
I think the use of MultiIndex overcomplicates this objective:
I need to map my user and item IDs to a continuous range of integer IDs starting at 0.
This solution falls in to the below category:
Alternatives without MultiIndex are completely acceptable.
def add_mapping(df, df2, df3, column_name='user_id'):
initial = df.loc[:, column_name].unique()
new = df2.loc[~df2.loc[:, column_name].isin(initial), column_name].unique()
maps = np.arange(len(initial))
mapping = dict(zip(initial, maps))
maps = np.append(maps, np.arange(np.max(maps)+1, np.max(maps)+1+len(new)))
total = np.append(initial, new)
mapping = dict(zip(total, maps))
df3[column_name+'_map'] = df3.loc[:, column_name].map(mapping)
return df3
add_mapping(df, df2, df3, column_name='item_id')
add_mapping(df, df2, df3, column_name='user_id')
user_id item_id rating item_id_map user_id_map
0 1 1 1.0 0 0
1 1 3 1.0 1 0
2 3 1 1.0 0 1
3 3 3 1.0 1 1
0 2 1 1.0 0 2
1 2 2 1.0 2 2
This is how to maintain a mapping for the user_id
values. Same holds for the item_id
values as well.
These are the initial user_id
values (unique):
initial_users = df['user_id'].unique()
# initial_users = array([1, 3])
user_map
maintains a mapping for user_id
values, as per your requirement:
user_id_maps = np.arange(len(initial_users))
# user_id_maps = array([0, 1])
user_map = dict(zip(initial_users, user_id_maps))
# user_map = {1: 0, 3: 1}
These are the new user_id
values you got from df2
- ones that you didn't see in df
:
new_users = df2[~df2['user_id'].isin(initial_users)]['user_id'].unique()
# new_users = array([2])
Now we update user_map
for the total user base with the new users:
user_id_maps = np.append(user_id_maps, np.arange(np.max(user_id_maps)+1, np.max(user_id_maps)+1+len(new_users)))
# array([0, 1, 2])
total_users = np.append(initial_users, new_users)
# array([1, 3, 2])
user_map = dict(zip(total_users, user_id_maps))
# user_map = {1: 0, 2: 2, 3: 1}
Then, just map the values from user_map
to df['user_id']
:
df3['user_map'] = df3['user_id'].map(user_map)
user_id item_id rating user_map
0 1 1 1.0 0
1 1 3 1.0 0
2 3 1 1.0 1
3 3 3 1.0 1
0 2 1 1.0 2
1 2 2 1.0 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With