I am utilizing pandas to create a dataframe that appears as follows:
ratings = pandas.DataFrame({
'article_a':[1,1,0,0],
'article_b':[1,0,0,0],
'article_c':[1,0,0,0],
'article_d':[0,0,0,1],
'article_e':[0,0,0,1]
},index=['Alice','Bob','Carol','Dave'])
I would like to compute another matrix from this input one that will compare each row against all other rows. Let's assume for example the computation was a function to find the length of the intersection set, I'd like an output DataFrame with the len(intersection(Alice,Bob))
, len(intersection(Alice,Carol))
, len(intersection(Alice,Dave))
in the first row, with each row following that format against the others. Using this example input, the output matrix would be 4x3:
len(intersection(Alice,Bob)),len(intersection(Alice,Carol)),len(intersection(Alice,Dave))
len(intersection(Bob,Alice)),len(intersection(Bob,Carol)),len(intersection(Bob,Dave))
len(intersection(Carol,Alice)),len(intersection(Carol,Bob)),len(intersection(Carol,Dave))
len(intersection(Dave,Alice)),len(intersection(Dave,Bob)),len(intersection(Dave,Carol))
Is there a named method for this kind of function based computation in pandas? What would be the most efficient way to accomplish this?
Use apply() function when you wanted to update every row in pandas DataFrame by calling a custom function. In order to apply a function to every row, you should use axis=1 param to apply(). By applying a function to each row, we can create a new column by using the values from the row, updating the row e.t.c.
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
I am not aware of a named method, but I have a one-liner.
In [21]: ratings.apply(lambda row: ratings.apply(
... lambda x: np.equal(row, x), 1).sum(1), 1)
Out[21]:
Alice Bob Carol Dave
Alice 5 3 2 0
Bob 3 5 4 2
Carol 2 4 5 3
Dave 0 2 3 5
@Dan Allan solution is 'right', here's a slightly different way of approaching the problem
In [26]: ratings
Out[26]:
article_a article_b article_c article_d article_e
Alice 1 1 1 0 0
Bob 1 0 0 0 0
Carol 0 0 0 0 0
Dave 0 0 0 1 1
In [27]: ratings.apply(lambda x: (ratings.T.sub(x,'index')).sum(),1)
Out[27]:
Alice Bob Carol Dave
Alice 0 -2 -3 -1
Bob 2 0 -1 1
Carol 3 1 0 2
Dave 1 -1 -2 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With