Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a co-occurence matrix of product orders in python?

Let's assume we have the following dataframe that includes customer orders (order_id) and the products that the individual order contained (product_id):

import pandas as pd

df = pd.DataFrame({'order_id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
                   'product_id' : [365, 48750, 3333, 9877, 48750, 32001, 3333, 3333, 365, 11202, 365]})
print(df)

   order_id product_id
0         1        365
1         1      48750
2         1       3333
3         1       9877
4         2      48750
5         2      32001
6         2       3333
7         3       3333
8         3        365
9         3      11202
10        3        365

It would be interesting to know how often product pairs appeared together in the same basket.

How does one create a co-occurence matrix in python that looks like this:

       365  48750  3333  9877  32001  11202
365      1      1     2     1      0      1
48750    1      0     2     1      1      0
3333     2      2     0     1      1      1
9877     1      1     1     0      0      0
32001    0      1     1     0      0      0
11202    1      0     1     0      0      0

I would be very thankful for your help!

like image 930
Zwiebak Avatar asked Nov 14 '20 18:11

Zwiebak


1 Answers

We start by grouping the df by order_id, and within each group calculate all possible pairs. Note we sort first by product_id so the same pairs in different groups are always in the same order

import itertools
all_pairs = []
for _, group in df.sort_values('product_id').groupby('order_id'):
    all_pairs += list(itertools.combinations(group['product_id'],2))

all_pairs

we get a list of all pairs from all orders

[('3333', '365'),
 ('3333', '48750'),
 ('3333', '9877'),
 ('365', '48750'),
 ('365', '9877'),
 ('48750', '9877'),
 ('32001', '3333'),
 ('32001', '48750'),
 ('3333', '48750'),
 ('11202', '3333'),
 ('11202', '365'),
 ('11202', '365'),
 ('3333', '365'),
 ('3333', '365'),
 ('365', '365')]

Now we count duplicates

from collections import Counter

count_dict = dict(Counter(all_pairs))
count_dict

so we get the count of each pair, basically what you are after

{('3333', '365'): 3,
 ('3333', '48750'): 2,
 ('3333', '9877'): 1,
 ('365', '48750'): 1,
 ('365', '9877'): 1,
 ('48750', '9877'): 1,
 ('32001', '3333'): 1,
 ('32001', '48750'): 1,
 ('11202', '3333'): 1,
 ('11202', '365'): 2,
 ('365', '365'): 1}

Putting this back into a cross-product table is a bit of work, the key bit is spliitng the tuples into columns by calling .apply(pd.Series) and eventually moving one of the columns to the column names via unstack:

(pd.DataFrame.from_dict(count_dict, orient='index')
    .reset_index(0)
    .set_index(0)['index']
    .apply(pd.Series)
    .rename(columns = {0:'pid1',1:'pid2'})
    .reset_index()
    .rename(columns = {0:'count'})
    .set_index(['pid1', 'pid2'] )
    .unstack()
    .fillna(0))

this produces a 'compact' form of the table you are after that only includes products that appeared in at least one pair


count
pid2    3333 365    48750  9877
pid1                
11202   1.0  2.0    0.0    0.0
32001   1.0  0.0    1.0    0.0
3333    0.0  3.0    2.0    1.0
365     0.0  1.0    1.0    1.0
48750   0.0  0.0    0.0    1.0

UPDATE Here is a rather simplified version of the above, following various discussions in the comments

import numpy as np
import pandas as pd
from collections import Counter

# we start as in the original solution but use permutations not combinations
all_pairs = []
for _, group in df.sort_values('product_id').groupby('order_id'):
    all_pairs += list(itertools.permutations(group['product_id'],2))
count_dict = dict(Counter(all_pairs))

# We create permutations for _all_ product_ids ... note we use unique() but also product(..) to allow for (365,265) combinations
total_pairs = list(itertools.product(df['product_id'].unique(),repeat = 2))

# pull out first and second elements separately
pid1 = [p[0] for p in total_pairs]
pid2 = [p[1] for p in total_pairs]

# and get the count for those permutations that exist from count_dict. Use 0
# for those that do not
count = [count_dict.get(p,0) for p in total_pairs]

# Now a bit of dataFrame magic
df_cross = pd.DataFrame({'pid1':pid1, 'pid2':pid2, 'count':count})
df_cross.set_index(['pid1','pid2']).unstack()

and we are done. df_cross below


count
pid2    11202   32001   3333    365 48750   9877
pid1                        
11202   0       0       1       2   0       0
32001   0       0       1       0   1       0
3333    1       1       0       3   2       1
365     2       0       3       2   1       1
48750   0       1       2       1   0       1
9877    0       0       1       1   1       0
like image 62
piterbarg Avatar answered Nov 09 '22 23:11

piterbarg