pandas matrix calculation till the diagonal

Question

i'm doing a matrix calculation using pandas in python.

my raw data is in the form of list of strings(which is unique for each row).

id     list_of_value
0      ['a','b','c']
1      ['d','b','c']
2      ['a','b','c']
3      ['a','b','c']

i have to do a calculate a score with one row and against all the other rows

score calculation algorithm:

Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 , 
        resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id(0).size

repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids.

Create N * N matrix:

-  0    1    2  3
0  1    0.6  1  1
1  0.6  1    1  1 
2  1    1    1  1
3  1    1    1  1

At present i'm using the pandas dummies approach to calculate the score:

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))

but there is an repetition in calculation after the diagonal of the matrix, the score calculation till diagonal is sufficient. for eg:

calculation of score of ID 0, will be only till ID(row,column) (0,0), score for ID(row,column) (0,1),(0,2),(0,3) can be copied from ID(row,column) (1,0),(2,0),(3,0).

Detail on the calculation: matrix sample i need to calculate till the diagonal, that is till the yellow colored box(the diagonal of matrix), the white values are already calculated in the green shaded area (for ref), i just have to transpose the green shaded area to white.

how can i do this in pandas?

Night Train · Accepted Answer

First of all here is a profiling of your code. First all commands separately, and then as you posted it.

%timeit df.list_of_value.explode()
%timeit pd.get_dummies(s)
%timeit s.sum(level=0)
%timeit s.dot(s.T)
%timeit s.sum(1)
%timeit s2.div(s3)

The above profiling returned the following results:

Explode   : 1000 loops, best of 3: 201 µs per loop
Dummies   : 1000 loops, best of 3: 697 µs per loop
Sum       : 1000 loops, best of 3: 1.36 ms per loop
Dot       : 1000 loops, best of 3: 453 µs per loop
Sum2      : 10000 loops, best of 3: 162 µs per loop
Divide    : 100 loops, best of 3: 1.81 ms per loop

Running Your two lines together results in:

100 loops, best of 3: 5.35 ms per loop

Using a different approach relying less on the (sometimes expensive) functionality of pandas, the code I created takes just about a third of the time by skipping the calculation for the upper triangular matrix and the diagonal as well.

import numpy as np

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
for i in range(len(df)):
    d0 = set(df.iloc[i].list_of_value)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(df)):
        df2[j, i] = len(d0.intersection(df.iloc[j].list_of_value)) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(df))])

With df given as

df = pd.DataFrame(
    [[['a','b','c']],
     [['d','b','c']],
     [['a','b','c']],
     [['a','b','c']]],
     columns = ["list_of_value"])

the profiling for this code results in a running time of only 1.68ms.

1000 loops, best of 3: 1.68 ms per loop

UPDATE

Instead of operating on the entire DataFrame, just picking the Series that is needed gives a huge speedup.

Three methods to iterate over the entries in the Series have been tested, and all of them are more or less equal regarding the performance.

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))

# get the Series from the DataFrame
dfl = df.list_of_value

for i, d0 in enumerate(dfl.values):
# for i, d0 in dfl.iteritems():  # in terms of performance about equal to the line above
# for i in range(len(dfl)): # slightly less performant than enumerate(dfl.values)
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl.iloc[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

There are a lot of pitfalls with pandas. E.g. always access the rows of a DataFrame or Series via df.iloc[0] instead of df[0]. Both works but df.iloc[0] is much faster.

The timings for the first matrix with 4 elements each with a list of size 3 resulted in a speedup of about 3 times as fast.

1000 loops, best of 3: 443 µs per loop

And when using a bigger dataset I got far better results with a speedup of over 11:

# operating on the DataFrame
10 loop, best of 3: 565 ms per loop

# operating on the Series
10 loops, best of 3: 47.7 ms per loop

UPDATE 2

When not using pandas at all (during the calculation), you get another significant speedup. Therefore you simply need to convert the column to operate on into a list.

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# convert the column of the DataFrame to a list
dfl = list(df.list_of_value)

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))

for i, d0 in enumerate(dfl):
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

On the data provided in the question we only see a slightly better result compared to the first update.

1000 loops, best of 3: 363 µs per loop

But when using bigger data (100 rows with lists of size 15) the advantage gets obvious:

100 loops, best of 3: 5.26 ms per loop

Here a comparison of all the suggested methods:

+----------+-----------------------------------------+
|          | Using the Dataset from the question     |
+----------+-----------------------------------------+
| Question | 100 loops, best of 3: 4.63 ms per loop  |
+----------+-----------------------------------------+
| Answer   | 1000 loops, best of 3: 1.59 ms per loop |
+----------+-----------------------------------------+
| Update 1 | 1000 loops, best of 3: 447 µs per loop  |
+----------+-----------------------------------------+
| Update 2 | 1000 loops, best of 3: 362 µs per loop  |
+----------+-----------------------------------------+

pandas matrix calculation till the diagonal

Tags:

python

pandas

Sriram Arvind Lakshmanakumar

Video Answer

1 Answers

UPDATE

UPDATE 2

Night Train

Recent Activity

Donate For Us

pandas matrix calculation till the diagonal

Tags:

python

pandas

Sriram Arvind Lakshmanakumar

Video Answer

1 Answers

UPDATE

UPDATE 2

Night Train

Related questions

Recent Activity

Donate For Us