A easier way has updated in the end of the question.
I have a user-user correlation matrix called matrixcorr_of_user
like the one below:
userId 316 320 359 370 910
userId
316 1.000000 0.202133 0.208618 0.176050 0.174035
320 0.202133 1.000000 0.242837 0.019035 0.031737
359 0.208618 0.242837 1.000000 0.357620 0.175914
370 0.176050 0.019035 0.357620 1.000000 0.317371
910 0.174035 0.031737 0.175914 0.317371 1.000000
For every user, I just want to keep the 2 other users that are the most similar to him (the highest correlation values per row after excluding the elements of the diagonal). Like so:
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
I know how to achieve it, but the way I came up with is too complicated. Could anyone provide a better idea?
I first melt
the matrix:
melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user")
melted_corr.head()
Out[23]:
userId corr_user value
0 316 316 1.000000
1 320 316 0.202133
2 359 316 0.208618
3 370 316 0.176050
4 910 316 0.174035
filter
it row by row:
get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3]
filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third)
filted
Out[39]:
corr_user userId
316 359 0.208618
320 0.202133
320 359 0.242837
316 0.202133
359 370 0.357620
320 0.242837
370 359 0.357620
910 0.317371
910 370 0.317371
359 0.175914
and finally reshape
it:
filted.reset_index().pivot_table("value","corr_user","userId")
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
I came up with a easier way to do this after saw the answer of @John Zwinck
Let's say there is a new matrix df
with some dupicated value and NaN
userId 316 320 359 370 910
userId
316 1.0 0.500000 0.500000 0.500000 NaN
320 0.5 1.000000 0.242837 0.019035 0.031737
359 0.5 0.242837 1.000000 0.357620 0.175914
370 0.5 0.019035 0.357620 1.000000 0.317371
910 NaN 0.031737 0.175914 0.317371 1.000000
At first I get the rank
of each row.
rank = df.rank(1, ascending=False, method="first")
Then I use df.isin()
to get the mask that I want.
mask = rank.isin(list(range(2,4)))
Finally
df.where(mask)
Then I get want I want.
userId 316 320 359 370 910
userId
316 NaN 0.5 0.500000 NaN NaN
320 0.5 NaN 0.242837 NaN NaN
359 0.5 NaN NaN 0.357620 NaN
370 0.5 NaN 0.357620 NaN NaN
910 NaN NaN 0.175914 0.317371 NaN
First, use np.argsort()
to find which locations have the highest values:
sort = np.argsort(df)
This gives a DataFrame whose column names are meaningless, but the second and third columns from the right contain the desired indices within each row:
316 320 359 370 910
userId
316 4 3 1 2 0
320 3 4 0 2 1
359 4 0 1 3 2
370 1 0 4 2 3
910 1 0 2 3 4
Next, construct a boolean mask, set to true in the above locations:
mask = np.zeros(df.shape, bool)
rows = np.arange(len(df))
mask[rows, sort.iloc[:,-2]] = True
mask[rows, sort.iloc[:,-3]] = True
Now you have the mask you need:
array([[False, True, True, False, False],
[ True, False, True, False, False],
[False, True, False, True, False],
[False, False, True, False, True],
[False, False, True, True, False]], dtype=bool)
Finally, df.where(mask)
:
316 320 359 370 910
userId
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
This should work:
melted_corr['group_rank']=melted_corr.groupby('userId')['value']\
.rank(ascending=False)
then select the top-2 per user with:
melted_corr[melted_corr.group_rank<=2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With