Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas/SQL co-occurrence count

Lets say I have the following table/data frame:

d = {'store': ['s1', 's1', 's2', 's2',], 'product': ['a', 'c', 'a', 'c']}
    df = pd.DataFrame(data=d)


print(df)
    store  product
0     s1      a                 
1     s1      c                     
3     s2      a                  
4     s2      c                

I would like to find, for each pair of products the number of times they co-occur in a store.

Since the data is very large (5M rows and about 50K individual products & 20K individual stores) and there are many potential co-occurrence pairs, I would just like to get the top n (example: 10) co-occurrences for each product and the count of the cooccurrence. The example result is below:

    product_1  product_2     cooccurrence_count
0      a           c                  2 
1      c           a                  2

An effective and efficient solution in SQL instead of pandas would also be acceptable

like image 294
Mustard Tiger Avatar asked Jun 21 '21 01:06

Mustard Tiger


People also ask

How do you count occurrences in pandas?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.

How do you count how many times a value appears in a data frame?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.

How do I count the number of series in Python?

The count() function is used to get number of non-NA/null observations in the Series. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series. Number of non-null values in the Series.

What is co occurrence matrix Python?

A co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.

How to count occurrences of specific value in pandas column?

How to Count Occurrences of Specific Value in Pandas Column? In this article, we will discuss how to count occurrences of a specific column value in the pandas column. We can count by using the value_counts () method.

How do you count the number of occurrences in a Dataframe?

How do you Count the Number of Occurrences in a data frame? In this post, you will learn how to use Pandas value_counts () method to count the occurrences in a column in the dataframe. First, we start by importing the needed packages and then we import example data from a CSV file.

How do I Count unique values in a Dataframe in pandas?

To count the number of occurences in e.g. a column in a dataframe you can use Pandas <code>value_counts ()</code> method. For example, if you type <code>df [‘condition’].value_counts ()</code> you will get the frequency of each unique value in the column “condition”.

How many occurrences of a certain value in the column?

Because we wanted to count the occurrences of a certain value we then selected Male. The output shows us that there are 4783 occurrences of this certain value in the column. As often, when working with programming languages, there are more approaches than one to solve a problem.


4 Answers

Try:

df.merge(df, on=['store']).query('product_x != product_y')\
  .groupby(['product_x','product_y'], as_index=False).count()\
  .rename(columns={'store':'cooccurence_count'})

Output:

  product_x product_y  cooccurence_count
0         a         c                  2
1         c         a                  2

With very large dataframes this might cause a memory problem.


Maybe this might help with memory useage:

from functools import reduce
l = {}
for n, g in df.groupby('store'):
    l[n] = g.merge(g, how='cross').query('product_x != product_y')\
            .groupby(['product_x', 'product_y']).count()

reduce(lambda x, y: x + y, l.values())

Let's chop it up by 'store'

like image 97
Scott Boston Avatar answered Oct 16 '22 21:10

Scott Boston


Try with pd.crosstab then dot and value_counts

s = pd.crosstab(df['store'],df['product'])
out = s.dot(s.columns+',').value_counts()
out
a,c,    2

Or we do

s = pd.crosstab(df['store'],df['product'])
s = s.T.dot(s).astype(float)
s.values[np.triu_indices(len(s))]=np.nan
s.stack()
product  product
c        a          2.0
dtype: float64
like image 2
BENY Avatar answered Oct 16 '22 22:10

BENY


Only because the question is well written and it seemed like a nice puzzle, here's some magic.

Potentially you'll have to store a lot of data, so you need to compress the frame as much as possible and do several passes through the base. If the database contains not primitive objects, convert those into integers, if you do multiprocessing, the dataframe will be copied into subprocesses, so keeping it contents small helps.

The runtime depends on the length of the dataframe but also on the number of unique stores, unique products and the size of a chunk of pairs to count. Spreading the work to many subprocesses can speed up things but there is constant cost to all the functions which will accumulate. For example, pandas' own methods will run faster on a single ten thousand rows dataframe than on a dozen of thousand row frames. And when you're running nested calls on sub dataframes of unpredictable size things get complicated. You'll probably have to experiment a bit to find a chunksize with optimal speed\memory usage.

Test runtimes with smaller numbers first. Including less shops and products. That being said, this is not a quick task. On high end machine it completes in about ten minutes.

import pandas as pd, numpy as np
df = pd.DataFrame({
  'store':np.random.randint(0,int(2e4),int(5e6)),
  'product':np.random.randint(0,int(5e4),int(5e6))
  }).sort_values('store')

products = df['product'].unique()
N, chunksize, Ntop = len(products), int(1e4), 200
dtype = np.min_scalar_type(max(products.max(),N))
df = df.astype(dtype)

def store_cats(df):
    df = df.astype('category')
    cats = [df[x].cat.categories for x in df.columns]
    for col in df.columns:
        df[col] = df[col].cat.codes
    return df, cats    
def restore_cats(summary,cats):
    for col in ['product_x','product_y']:
        summary[col] = pandas.Categorical.from_codes(summary[col], cats)

def subsets(n = chunksize):
    n = int(n)
    res = [frozenset(products[i:i+n]) for i in range(0,N,n)]
    info = 'In total there will be {:.1E} pairs, per pass {:.1E} will be checked, thats up to around {} mb per pass, {} passes'
    print(info.format((N**2),(n*N),(n*N*3*8/1e6),len(res)))
    return res

def count(df,subset):
    res = df.merge(df,on = 'store')\
        .query('(product_x < product_y) and product_x in @subset')\
        .groupby(['product_x','product_y'])\
        .count()\
        .astype(dtype)\
        .reset_index()
    return res 
def one_pass(gr,subset):
    per_group = gr.apply(count,subset)
    total_counts = per_group.sort_values(['product_x','product_y'])\
        .groupby(['product_x','product_y'])\
        .agg('sum')\
        .sort_values('store',ascending=False)[:Ntop]\
        .copy().reset_index()
    return total_counts
def merge_passes(dfs):
    res = pd.concat(dfs,ignore_index=True)
    res = res.append(res.rename(columns={'product_x':'product_y','product_y':'product_x'}),ignore_index=True)
    res = res.sort_values('store',ascending=False)[:Ntop]
    return res

from concurrent.futures import as_completed, ProcessPoolExecutor as Pool

gr = df.groupby('store',as_index = False)
def worker(subset):
    return one_pass(gr,subset)
def run_progress(max_workers=2,chunksize=chunksize):
    from tqdm.auto import tqdm 
    with Pool(max_workers = max_workers) as p:
        futures = [p.submit(worker,subset) for subset in subsets(chunksize)]
        summaries = [x.result() for x in tqdm(as_completed(futures),total=len(futures))]
        return merge_passes(summaries)
like image 1
Dimitry Avatar answered Oct 16 '22 22:10

Dimitry


I honestly don't know how this will perform on a set that large, but here's a sql option:

-- test data
CREATE TABLE #T (
    store varchar(10), 
    product varchar(5)
    )

INSERT INTO #T (store, product)
VALUES
('s1','a'),
('s1','c'),
('s2','a'),
('s2','c')


-- the part you really want:
SELECT TOP 10 
      prod1.product_1
    , prod2.product_2
    , COUNT(*) cooccurrence_count
FROM 
    (SELECT product product_1, store from #t) prod1
    INNER JOIN 
    (SELECT product product_2, store from #t) prod2
    ON prod1.store = prod2.store AND prod1.product_1 <> prod2.product_2
GROUP BY prod1.product_1, prod2.product_2
ORDER BY cooccurrence_count desc

like image 1
dww142 Avatar answered Oct 16 '22 23:10

dww142