Appending pandas DataFrame with MultiIndex with data containing new labels, but preserving the integer positions of the old MultiIndex

Base scenario

For a recommendation service I am training a matrix factorization model (LightFM) on a set of user-item interactions. For the matrix factorization model to yield the best results, I need to map my user and item IDs to a continuous range of integer IDs starting at 0.

I'm using a pandas DataFrame in the process, and I have found a MultiIndex to be extremely convenient to create this mapping, like so:

ratings = [{'user_id': 1, 'item_id': 1, 'rating': 1.0},
           {'user_id': 1, 'item_id': 3, 'rating': 1.0},
           {'user_id': 3, 'item_id': 1, 'rating': 1.0},
           {'user_id': 3, 'item_id': 3, 'rating': 1.0}]

df = pd.DataFrame(ratings, columns=['user_id', 'item_id', 'rating'])
df = df.set_index(['user_id', 'item_id'])
df

Out:
                 rating
user_id item_id 
1       1        1.0
1       3        1.0
3       1        1.0
3       1        1.0

And then allows me to get the continuous maps like so

df.index.labels[0]    # For users

Out:
FrozenNDArray([0, 0, 1, 1], dtype='int8')

df.index.labels[1]    # For items

Out:
FrozenNDArray([0, 1, 0, 1], dtype='int8')

Afterwards, I can map them back using df.index.levels[0].get_loc method. Great!

Extension

But, now I'm trying to streamline my model training process, ideally by training it incrementally on new data, preserving the old ID mappings. Something like:

new_ratings = [{'user_id': 2, 'item_id': 1, 'rating': 1.0},
               {'user_id': 2, 'item_id': 2, 'rating': 1.0}]

df2 = pd.DataFrame(new_ratings, columns=['user_id', 'item_id', 'rating'])
df2 = df2.set_index(['user_id', 'item_id'])
df2

Out:
                 rating
user_id item_id 
2       1        1.0
2       2        1.0

Then, simply appending the new ratings to the old DataFrame

df3 = df.append(df2)
df3

Out:
                 rating
user_id item_id 
1       1        1.0
1       3        1.0
3       1        1.0
3       3        1.0
2       1        1.0
2       2        1.0

Looks good, but

df3.index.labels[0]    # For users

Out:
FrozenNDArray([0, 0, 2, 2, 1, 1], dtype='int8')

df3.index.labels[1]    # For items

Out:
FrozenNDArray([0, 2, 0, 2, 0, 1], dtype='int8')

I added user_id=2 and item_id=2 in the later DataFrame on purpose, to illustrate where it goes wrong for me. In df3, labels 3 (for both user and item), have moved from integer position 1 to 2. So the mapping is no longer the same. What I'm looking for is [0, 0, 1, 1, 2, 2] and [0, 1, 0, 1, 0, 2] for user and item mappings respectively.

This is probably because of ordering in pandas Index objects, and I'm unsure if what I want is at all possible using a MultiIndex strategy. Looking for help on how most to effectively tackle this problem :)

Some notes:

I find using DataFrames convenient for several reasons, but I use the MultiIndex purely for the ID mappings. Alternatives without MultiIndex are completely acceptable.
I cannot guarantee that new user_id and item_id entries in new ratings are larger than any values in the old dataset, hence my example of adding id 2 when [1, 3] were present.
For my incremental training approach, I will need to store my ID maps somewhere. If I only load new ratings partially, I will have to store the old DataFrame and ID maps somewhere. Would be great if it could all be in one place, like it would be with an index, but columns work too.
EDIT: An additional requirement is to allow for row re-ordering of the original DataFrame, as might happen when duplicate ratings exist, and I want to keep the most recent one.

Solution (credits to @jpp for original)

I've made a modification to @jpp's answer to satisfy the additional requirement I've added later (tagged with EDIT). This also truly satisfies the original question as posed in the title, since it preserves the old index integer positions, regardless of rows being reordered for whatever reason. I've also wrapped things into functions:

from itertools import chain
from toolz import unique


def expand_index(source, target, index_cols=['user_id', 'item_id']):

    # Elevate index to series, keeping source with index
    temp = source.reset_index()
    target = target.reset_index()

    # Convert columns to categorical, using the source index and target columns
    for col in index_cols:
        i = source.index.names.index(col)
        col_cats = list(unique(chain(source.index.levels[i], target[col])))

        temp[col] = pd.Categorical(temp[col], categories=col_cats)
        target[col] = pd.Categorical(target[col], categories=col_cats)

    # Convert series back to index
    source = temp.set_index(index_cols)
    target = target.set_index(index_cols)

    return source, target


def concat_expand_index(old, new):
    old, new = expand_index(old, new)
    return pd.concat([old, new])


df3 = concat_expand_index(df, df2)

The result:

df3.index.labels[0]    # For users

Out:
FrozenNDArray([0, 0, 1, 1, 2, 2], dtype='int8')

df3.index.labels[1]    # For items

Out:
FrozenNDArray([0, 1, 0, 1, 0, 2], dtype='int8')

745

asked May 19 '18 23:05

Fulco

1 Answers

I think the use of MultiIndex overcomplicates this objective:

I need to map my user and item IDs to a continuous range of integer IDs starting at 0.

This solution falls in to the below category:

Alternatives without MultiIndex are completely acceptable.

def add_mapping(df, df2, df3, column_name='user_id'):

    initial = df.loc[:, column_name].unique()
    new = df2.loc[~df2.loc[:, column_name].isin(initial), column_name].unique()
    maps = np.arange(len(initial))
    mapping = dict(zip(initial, maps))
    maps = np.append(maps, np.arange(np.max(maps)+1, np.max(maps)+1+len(new)))
    total = np.append(initial, new)
    mapping = dict(zip(total, maps))

    df3[column_name+'_map'] = df3.loc[:, column_name].map(mapping) 

    return df3

add_mapping(df, df2, df3, column_name='item_id')
add_mapping(df, df2, df3, column_name='user_id')

 user_id    item_id rating  item_id_map user_id_map
0   1          1    1.0         0           0
1   1          3    1.0         1           0
2   3          1    1.0         0           1
3   3          3    1.0         1           1
0   2          1    1.0         0           2
1   2          2    1.0         2           2

Explanation

This is how to maintain a mapping for the user_id values. Same holds for the item_id values as well.

These are the initial user_id values (unique):

initial_users = df['user_id'].unique()
# initial_users = array([1, 3])

user_map maintains a mapping for user_id values, as per your requirement:

user_id_maps = np.arange(len(initial_users))
# user_id_maps = array([0, 1])

user_map = dict(zip(initial_users, user_id_maps))
# user_map = {1: 0, 3: 1}

These are the new user_id values you got from df2 - ones that you didn't see in df:

new_users = df2[~df2['user_id'].isin(initial_users)]['user_id'].unique()
# new_users = array([2])

Now we update user_map for the total user base with the new users:

user_id_maps = np.append(user_id_maps, np.arange(np.max(user_id_maps)+1, np.max(user_id_maps)+1+len(new_users)))
# array([0, 1, 2])
total_users = np.append(initial_users, new_users)
# array([1, 3, 2])

user_map = dict(zip(total_users, user_id_maps))
# user_map = {1: 0, 2: 2, 3: 1}

Then, just map the values from user_map to df['user_id']:

df3['user_map'] = df3['user_id'].map(user_map)

user_id item_id rating  user_map
0   1   1       1.0          0
1   1   3       1.0          0
2   3   1       1.0          1
3   3   3       1.0          1
0   2   1       1.0          2
1   2   2       1.0          2

answered Oct 13 '22 00:10

akilat90

Related questions
                            
                                How to append new categories to HDF5 in pandas?
                            
                                Implications of using MPI with TensorFlow
                            
                                Python logger with a callback function
                            
                                docker-py: How to bind an IP address to a container
                            
                                How to upload files into BinaryField using FileField widget in Django Admin?
                            
                                How to convert numpy matrix to cv2 image [python]
                            
                                Python has stopped working
                            
                                changing all dates to standard date time in dataframe
                            
                                Python - selenium webdriver stuck at .get() in a loop
                            
                                Matplotlib render all internal voxels (with alpha)
                            
                                Tensorflow error: TypeError: __init__() got an unexpected keyword argument 'dct_method' [closed]
                            
                                keras-js "Error: [Model] Model configuration does not contain any layers."
                            
                                How to specify dependencies that setup.py needs during installation?
                            
                                Importing matplotlib with reticulate in R
                            
                                Hosting interactive jupyter notebook on private website
                            
                                How to save plot of live data on a remote machine?
                            
                                How to use shared uploaded file on Google Colab
                            
                                Memory-efficient storage of large distance matrices
                            
                                How can I downgrade the version pytorch from 0.4 to 0.31 with anaconda?
                            
                                How can I add dependency link to repo subdirectory in setup.py

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Appending pandas DataFrame with MultiIndex with data containing new labels, but preserving the integer positions of the old MultiIndex

Tags:

python

pandas

numpy

categorical-data

recommendation-engine