Lets say I have the following table/data frame:
d = {'store': ['s1', 's1', 's2', 's2',], 'product': ['a', 'c', 'a', 'c']}
df = pd.DataFrame(data=d)
print(df)
store product
0 s1 a
1 s1 c
3 s2 a
4 s2 c
I would like to find, for each pair of products the number of times they co-occur in a store.
Since the data is very large (5M rows and about 50K individual products & 20K individual stores) and there are many potential co-occurrence pairs, I would just like to get the top n (example: 10) co-occurrences for each product and the count of the cooccurrence. The example result is below:
product_1 product_2 cooccurrence_count
0 a c 2
1 c a 2
An effective and efficient solution in SQL instead of pandas would also be acceptable
Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
The count() function is used to get number of non-NA/null observations in the Series. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series. Number of non-null values in the Series.
A co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.
How to Count Occurrences of Specific Value in Pandas Column? In this article, we will discuss how to count occurrences of a specific column value in the pandas column. We can count by using the value_counts () method.
How do you Count the Number of Occurrences in a data frame? In this post, you will learn how to use Pandas value_counts () method to count the occurrences in a column in the dataframe. First, we start by importing the needed packages and then we import example data from a CSV file.
To count the number of occurences in e.g. a column in a dataframe you can use Pandas <code>value_counts ()</code> method. For example, if you type <code>df [‘condition’].value_counts ()</code> you will get the frequency of each unique value in the column “condition”.
Because we wanted to count the occurrences of a certain value we then selected Male. The output shows us that there are 4783 occurrences of this certain value in the column. As often, when working with programming languages, there are more approaches than one to solve a problem.
Try:
df.merge(df, on=['store']).query('product_x != product_y')\
.groupby(['product_x','product_y'], as_index=False).count()\
.rename(columns={'store':'cooccurence_count'})
Output:
product_x product_y cooccurence_count
0 a c 2
1 c a 2
With very large dataframes this might cause a memory problem.
Maybe this might help with memory useage:
from functools import reduce
l = {}
for n, g in df.groupby('store'):
l[n] = g.merge(g, how='cross').query('product_x != product_y')\
.groupby(['product_x', 'product_y']).count()
reduce(lambda x, y: x + y, l.values())
Let's chop it up by 'store'
Try with pd.crosstab
then dot
and value_counts
s = pd.crosstab(df['store'],df['product'])
out = s.dot(s.columns+',').value_counts()
out
a,c, 2
Or we do
s = pd.crosstab(df['store'],df['product'])
s = s.T.dot(s).astype(float)
s.values[np.triu_indices(len(s))]=np.nan
s.stack()
product product
c a 2.0
dtype: float64
Only because the question is well written and it seemed like a nice puzzle, here's some magic.
Potentially you'll have to store a lot of data, so you need to compress the frame as much as possible and do several passes through the base. If the database contains not primitive objects, convert those into integers, if you do multiprocessing, the dataframe will be copied into subprocesses, so keeping it contents small helps.
The runtime depends on the length of the dataframe but also on the number of unique stores, unique products and the size of a chunk of pairs to count. Spreading the work to many subprocesses can speed up things but there is constant cost to all the functions which will accumulate. For example, pandas' own methods will run faster on a single ten thousand rows dataframe than on a dozen of thousand row frames. And when you're running nested calls on sub dataframes of unpredictable size things get complicated. You'll probably have to experiment a bit to find a chunksize with optimal speed\memory usage.
Test runtimes with smaller numbers first. Including less shops and products. That being said, this is not a quick task. On high end machine it completes in about ten minutes.
import pandas as pd, numpy as np
df = pd.DataFrame({
'store':np.random.randint(0,int(2e4),int(5e6)),
'product':np.random.randint(0,int(5e4),int(5e6))
}).sort_values('store')
products = df['product'].unique()
N, chunksize, Ntop = len(products), int(1e4), 200
dtype = np.min_scalar_type(max(products.max(),N))
df = df.astype(dtype)
def store_cats(df):
df = df.astype('category')
cats = [df[x].cat.categories for x in df.columns]
for col in df.columns:
df[col] = df[col].cat.codes
return df, cats
def restore_cats(summary,cats):
for col in ['product_x','product_y']:
summary[col] = pandas.Categorical.from_codes(summary[col], cats)
def subsets(n = chunksize):
n = int(n)
res = [frozenset(products[i:i+n]) for i in range(0,N,n)]
info = 'In total there will be {:.1E} pairs, per pass {:.1E} will be checked, thats up to around {} mb per pass, {} passes'
print(info.format((N**2),(n*N),(n*N*3*8/1e6),len(res)))
return res
def count(df,subset):
res = df.merge(df,on = 'store')\
.query('(product_x < product_y) and product_x in @subset')\
.groupby(['product_x','product_y'])\
.count()\
.astype(dtype)\
.reset_index()
return res
def one_pass(gr,subset):
per_group = gr.apply(count,subset)
total_counts = per_group.sort_values(['product_x','product_y'])\
.groupby(['product_x','product_y'])\
.agg('sum')\
.sort_values('store',ascending=False)[:Ntop]\
.copy().reset_index()
return total_counts
def merge_passes(dfs):
res = pd.concat(dfs,ignore_index=True)
res = res.append(res.rename(columns={'product_x':'product_y','product_y':'product_x'}),ignore_index=True)
res = res.sort_values('store',ascending=False)[:Ntop]
return res
from concurrent.futures import as_completed, ProcessPoolExecutor as Pool
gr = df.groupby('store',as_index = False)
def worker(subset):
return one_pass(gr,subset)
def run_progress(max_workers=2,chunksize=chunksize):
from tqdm.auto import tqdm
with Pool(max_workers = max_workers) as p:
futures = [p.submit(worker,subset) for subset in subsets(chunksize)]
summaries = [x.result() for x in tqdm(as_completed(futures),total=len(futures))]
return merge_passes(summaries)
I honestly don't know how this will perform on a set that large, but here's a sql option:
-- test data
CREATE TABLE #T (
store varchar(10),
product varchar(5)
)
INSERT INTO #T (store, product)
VALUES
('s1','a'),
('s1','c'),
('s2','a'),
('s2','c')
-- the part you really want:
SELECT TOP 10
prod1.product_1
, prod2.product_2
, COUNT(*) cooccurrence_count
FROM
(SELECT product product_1, store from #t) prod1
INNER JOIN
(SELECT product product_2, store from #t) prod2
ON prod1.store = prod2.store AND prod1.product_1 <> prod2.product_2
GROUP BY prod1.product_1, prod2.product_2
ORDER BY cooccurrence_count desc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With