Let's assume we have the following dataframe that includes customer orders (order_id) and the products that the individual order contained (product_id):
import pandas as pd
df = pd.DataFrame({'order_id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
'product_id' : [365, 48750, 3333, 9877, 48750, 32001, 3333, 3333, 365, 11202, 365]})
print(df)
order_id product_id
0 1 365
1 1 48750
2 1 3333
3 1 9877
4 2 48750
5 2 32001
6 2 3333
7 3 3333
8 3 365
9 3 11202
10 3 365
It would be interesting to know how often product pairs appeared together in the same basket.
How does one create a co-occurence matrix in python that looks like this:
365 48750 3333 9877 32001 11202
365 1 1 2 1 0 1
48750 1 0 2 1 1 0
3333 2 2 0 1 1 1
9877 1 1 1 0 0 0
32001 0 1 1 0 0 0
11202 1 0 1 0 0 0
I would be very thankful for your help!
We start by grouping the df by order_id, and within each group calculate all possible pairs. Note we sort first by product_id so the same pairs in different groups are always in the same order
import itertools
all_pairs = []
for _, group in df.sort_values('product_id').groupby('order_id'):
all_pairs += list(itertools.combinations(group['product_id'],2))
all_pairs
we get a list of all pairs from all orders
[('3333', '365'),
('3333', '48750'),
('3333', '9877'),
('365', '48750'),
('365', '9877'),
('48750', '9877'),
('32001', '3333'),
('32001', '48750'),
('3333', '48750'),
('11202', '3333'),
('11202', '365'),
('11202', '365'),
('3333', '365'),
('3333', '365'),
('365', '365')]
Now we count duplicates
from collections import Counter
count_dict = dict(Counter(all_pairs))
count_dict
so we get the count of each pair, basically what you are after
{('3333', '365'): 3,
('3333', '48750'): 2,
('3333', '9877'): 1,
('365', '48750'): 1,
('365', '9877'): 1,
('48750', '9877'): 1,
('32001', '3333'): 1,
('32001', '48750'): 1,
('11202', '3333'): 1,
('11202', '365'): 2,
('365', '365'): 1}
Putting this back into a cross-product table is a bit of work, the key bit is spliitng the tuples into columns by calling .apply(pd.Series)
and eventually moving one of the columns to the column names via unstack
:
(pd.DataFrame.from_dict(count_dict, orient='index')
.reset_index(0)
.set_index(0)['index']
.apply(pd.Series)
.rename(columns = {0:'pid1',1:'pid2'})
.reset_index()
.rename(columns = {0:'count'})
.set_index(['pid1', 'pid2'] )
.unstack()
.fillna(0))
this produces a 'compact' form of the table you are after that only includes products that appeared in at least one pair
count
pid2 3333 365 48750 9877
pid1
11202 1.0 2.0 0.0 0.0
32001 1.0 0.0 1.0 0.0
3333 0.0 3.0 2.0 1.0
365 0.0 1.0 1.0 1.0
48750 0.0 0.0 0.0 1.0
UPDATE Here is a rather simplified version of the above, following various discussions in the comments
import numpy as np
import pandas as pd
from collections import Counter
# we start as in the original solution but use permutations not combinations
all_pairs = []
for _, group in df.sort_values('product_id').groupby('order_id'):
all_pairs += list(itertools.permutations(group['product_id'],2))
count_dict = dict(Counter(all_pairs))
# We create permutations for _all_ product_ids ... note we use unique() but also product(..) to allow for (365,265) combinations
total_pairs = list(itertools.product(df['product_id'].unique(),repeat = 2))
# pull out first and second elements separately
pid1 = [p[0] for p in total_pairs]
pid2 = [p[1] for p in total_pairs]
# and get the count for those permutations that exist from count_dict. Use 0
# for those that do not
count = [count_dict.get(p,0) for p in total_pairs]
# Now a bit of dataFrame magic
df_cross = pd.DataFrame({'pid1':pid1, 'pid2':pid2, 'count':count})
df_cross.set_index(['pid1','pid2']).unstack()
and we are done. df_cross
below
count
pid2 11202 32001 3333 365 48750 9877
pid1
11202 0 0 1 2 0 0
32001 0 0 1 0 1 0
3333 1 1 0 3 2 1
365 2 0 3 2 1 1
48750 0 1 2 1 0 1
9877 0 0 1 1 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With