Lets say I have the following table/data frame: <pre class="prettyprint"><code>d = {'store': ['s1', 's1', 's2', 's2',], 'product': ['a', 'c', 'a', 'c']} df = pd.DataFrame(data=d) print(df) store product 0 s1 a 1 s1 c 3 s2 a 4 s2 c </code></pre> I would like to find, for each pair of products the number of times they co-occur in a store. Since the data is very large (5M rows and about 50K individual products & 20K individual stores) and there are many potential co-occurrence pairs, I would just like to get the top n (example: 10) co-occurrences for each product and the count of the cooccurrence. The example result is below: <pre class="prettyprint"><code> product_1 product_2 cooccurrence_count 0 a c 2 1 c a 2 </code></pre> An effective and efficient solution in SQL instead of pandas would also be acceptable

Try with <code>pd.crosstab</code> then <code>dot</code> and <code>value_counts</code> <pre class="prettyprint"><code>s = pd.crosstab(df['store'],df['product']) out = s.dot(s.columns+',').value_counts() out a,c, 2 </code></pre> Or we do <pre class="prettyprint"><code>s = pd.crosstab(df['store'],df['product']) s = s.T.dot(s).astype(float) s.values[np.triu_indices(len(s))]=np.nan s.stack() product product c a 2.0 dtype: float64 </code></pre>

Pandas/SQL co-occurrence count

Tags:

python

pandas

dataframe

Lets say I have the following table/data frame:

Click to copy

d = {'store': ['s1', 's1', 's2', 's2',], 'product': ['a', 'c', 'a', 'c']}
    df = pd.DataFrame(data=d)


print(df)
    store  product
0     s1      a                 
1     s1      c                     
3     s2      a                  
4     s2      c

I would like to find, for each pair of products the number of times they co-occur in a store.

Since the data is very large (5M rows and about 50K individual products & 20K individual stores) and there are many potential co-occurrence pairs, I would just like to get the top n (example: 10) co-occurrences for each product and the count of the cooccurrence. The example result is below:

Click to copy

    product_1  product_2     cooccurrence_count
0      a           c                  2 
1      c           a                  2

An effective and efficient solution in SQL instead of pandas would also be acceptable

294

asked Jun 21 '21 01:06

Mustard Tiger

4 Answers

Try:

Click to copy

df.merge(df, on=['store']).query('product_x != product_y')\
  .groupby(['product_x','product_y'], as_index=False).count()\
  .rename(columns={'store':'cooccurence_count'})

Output:

Click to copy

  product_x product_y  cooccurence_count
0         a         c                  2
1         c         a                  2

With very large dataframes this might cause a memory problem.

Maybe this might help with memory useage:

Click to copy

from functools import reduce
l = {}
for n, g in df.groupby('store'):
    l[n] = g.merge(g, how='cross').query('product_x != product_y')\
            .groupby(['product_x', 'product_y']).count()

reduce(lambda x, y: x + y, l.values())

Let's chop it up by 'store'

answered Oct 16 '22 21:10

Scott Boston

Try with pd.crosstab then dot and value_counts

Click to copy

s = pd.crosstab(df['store'],df['product'])
out = s.dot(s.columns+',').value_counts()
out
a,c,    2

Or we do

Click to copy

s = pd.crosstab(df['store'],df['product'])
s = s.T.dot(s).astype(float)
s.values[np.triu_indices(len(s))]=np.nan
s.stack()
product  product
c        a          2.0
dtype: float64

answered Oct 16 '22 22:10

BENY

Only because the question is well written and it seemed like a nice puzzle, here's some magic.

Potentially you'll have to store a lot of data, so you need to compress the frame as much as possible and do several passes through the base. If the database contains not primitive objects, convert those into integers, if you do multiprocessing, the dataframe will be copied into subprocesses, so keeping it contents small helps.

The runtime depends on the length of the dataframe but also on the number of unique stores, unique products and the size of a chunk of pairs to count. Spreading the work to many subprocesses can speed up things but there is constant cost to all the functions which will accumulate. For example, pandas' own methods will run faster on a single ten thousand rows dataframe than on a dozen of thousand row frames. And when you're running nested calls on sub dataframes of unpredictable size things get complicated. You'll probably have to experiment a bit to find a chunksize with optimal speed\memory usage.

Test runtimes with smaller numbers first. Including less shops and products. That being said, this is not a quick task. On high end machine it completes in about ten minutes.

Click to copy

import pandas as pd, numpy as np
df = pd.DataFrame({
  'store':np.random.randint(0,int(2e4),int(5e6)),
  'product':np.random.randint(0,int(5e4),int(5e6))
  }).sort_values('store')

products = df['product'].unique()
N, chunksize, Ntop = len(products), int(1e4), 200
dtype = np.min_scalar_type(max(products.max(),N))
df = df.astype(dtype)

def store_cats(df):
    df = df.astype('category')
    cats = [df[x].cat.categories for x in df.columns]
    for col in df.columns:
        df[col] = df[col].cat.codes
    return df, cats    
def restore_cats(summary,cats):
    for col in ['product_x','product_y']:
        summary[col] = pandas.Categorical.from_codes(summary[col], cats)

def subsets(n = chunksize):
    n = int(n)
    res = [frozenset(products[i:i+n]) for i in range(0,N,n)]
    info = 'In total there will be {:.1E} pairs, per pass {:.1E} will be checked, thats up to around {} mb per pass, {} passes'
    print(info.format((N**2),(n*N),(n*N*3*8/1e6),len(res)))
    return res

def count(df,subset):
    res = df.merge(df,on = 'store')\
        .query('(product_x < product_y) and product_x in @subset')\
        .groupby(['product_x','product_y'])\
        .count()\
        .astype(dtype)\
        .reset_index()
    return res 
def one_pass(gr,subset):
    per_group = gr.apply(count,subset)
    total_counts = per_group.sort_values(['product_x','product_y'])\
        .groupby(['product_x','product_y'])\
        .agg('sum')\
        .sort_values('store',ascending=False)[:Ntop]\
        .copy().reset_index()
    return total_counts
def merge_passes(dfs):
    res = pd.concat(dfs,ignore_index=True)
    res = res.append(res.rename(columns={'product_x':'product_y','product_y':'product_x'}),ignore_index=True)
    res = res.sort_values('store',ascending=False)[:Ntop]
    return res

from concurrent.futures import as_completed, ProcessPoolExecutor as Pool

gr = df.groupby('store',as_index = False)
def worker(subset):
    return one_pass(gr,subset)
def run_progress(max_workers=2,chunksize=chunksize):
    from tqdm.auto import tqdm 
    with Pool(max_workers = max_workers) as p:
        futures = [p.submit(worker,subset) for subset in subsets(chunksize)]
        summaries = [x.result() for x in tqdm(as_completed(futures),total=len(futures))]
        return merge_passes(summaries)

answered Oct 16 '22 22:10

Dimitry

I honestly don't know how this will perform on a set that large, but here's a sql option:

Click to copy

-- test data
CREATE TABLE #T (
    store varchar(10), 
    product varchar(5)
    )

INSERT INTO #T (store, product)
VALUES
('s1','a'),
('s1','c'),
('s2','a'),
('s2','c')


-- the part you really want:
SELECT TOP 10 
      prod1.product_1
    , prod2.product_2
    , COUNT(*) cooccurrence_count
FROM 
    (SELECT product product_1, store from #t) prod1
    INNER JOIN 
    (SELECT product product_2, store from #t) prod2
    ON prod1.store = prod2.store AND prod1.product_1 <> prod2.product_2
GROUP BY prod1.product_1, prod2.product_2
ORDER BY cooccurrence_count desc

answered Oct 16 '22 23:10

dww142

Related questions
                            
                                Typechecking dynamically added attributes
                            
                                Django + Uvicorn
                            
                                Problem installing scikit-image probably due to blosc
                            
                                How to make a python request to deepL API?
                            
                                get list of events associated with a DOM elements
                            
                                0 accuracy with LSTM
                            
                                No module named 'application' Error while deploying simple web app to Elastic Beanstalk
                            
                                Django, drf-yasg - how to add description to tags?
                            
                                How to disable pylint inspections for anything that uses my function?
                            
                                How to remove or hide y-axis ticklabels from a matplotlib / seaborn plot
                            
                                PyTorch torch.no_grad() versus requires_grad=False
                            
                                Celery - how to get task name by task id?
                            
                                How to make a Trainer pad inputs in a batch with huggingface-transformers?
                            
                                Python - iterate and update a nested dictionary & lists
                            
                                Merging multiple videos in a template/layout with Python FFMPEG?
                            
                                Calling R script from Python does not save log file in version 4
                            
                                Conda 3.9 is not working with numpy on macOS ("Reason: Image not found")?
                            
                                How can I rotate an image based on object position?
                            
                                What does this deprecation warning mean, and how to fix it?
                            
                                How to use df.loc (or some other method) to make a new column based on specific conditions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas/SQL co-occurrence count

Tags:

python

pandas

dataframe

Mustard Tiger

People also ask

4 Answers

Scott Boston

BENY

Dimitry

dww142

Recent Activity

Donate For Us