Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding weekly combinations of items bought together using pandas groupby

I have a df:

date            category            subcategory         order_id        product_id       branch
2021-05-04      A                   aa                  10              5                web
2021-06-04      A                   dd                  10              2                web
2021-05-06      B                   aa                  18              3                shop
2021-07-06      A                   aa                  50              10               web
2021-07-06      C                   cc                  10              15               web
2021-07-05      A                   ff                  101             30               shop
2021-10-04      D                   aa                  100             15               shop  

I am trying to answer a question which items categories and subcategories are bought together per branch type weekly. I am thinking of grouping the order_ids and aggregating the category & subcategory to a list like so:

a = (df.set_index('date')
     .groupby(['order_id','branch'])
     .resample('W-MON', label = 'left')
     .agg({'category':list, 'subcategory':list}))

Which returns :

                                                category            subcategory
order_id        branch          date            [A, A, A]           [aa, dd, aa]
10              web             2021-05-04      ...                 ...
18              shop            ...
50              web
100             web
101             shop

I am trying to build a structure which would show the frequency of each variation of the categories and subcategories bought each week per branch, something similar to this:

branch                                  date                        
                                        2021-05-04                      2021-05-011
                                                                        ...
web                  category            3, [A, A, A] 
                                         2, [A, A] 
                                         2, [A, A, B, B]

                     subcategory         5, [aa, dd, aa]
                                         4, [dd, aa]
                                         1, [dd]

shop                 category            3, [A, A, A] 
                                         2, [A, A] 
                                         2, [A, A, B, B]

                     subcategory         5, [aa, dd, aa]
                                         4, [dd, aa]
                                         1, [dd]

Where the number before the list denotes the number of times a certain combinations of categories and subcategories were bought in the same order. I am unsure how to achieve such a structure or a similar one that would show the weekly combination frequencies by branch. The order of the product_id in the order does not matter as the final basket is the same.

So the goal is to see the frequency of categories, subcategories & product_ids bought in the same order weekly. So if 2 different orders have the same products, the aggregated result would show 2, [A,B] [aa, bb] [5, 2] where the lists hold category, subcategory & product_id combinations.

like image 825
Jonas Palačionis Avatar asked Nov 05 '21 13:11

Jonas Palačionis


People also ask

How to groupby two columns in pandas Dataframe?

In this article, we will GroupBy two columns and count the occurrences of each combination in Pandas. DataFrame.groupby () method is used to separate the DataFrame into groups. It will generate the number of similar data counts present in a particular column of the data frame. by: mapping, function, string, label, or iterable to group elements.

How does groupby work in pandas?

Similar to the SQL GROUP BY statement, the Pandas method works by splitting our data, aggregating it in a given way (or ways), and re-combining the data in a meaningful way. Because the .groupby () method works by first splitting the data, we can actually work with the groups directly.

How do I get the Count of a pandas group?

Pandas groupby () and using agg (‘count’) Alternatively, you can also get the group count by using agg () or aggregate () function and passing the aggregate count function as a param. reset_index () function is used to set the index on DataFrame. By using this approach you can compute multiple aggregations. Yields below output. 7.

What is split-apply-combine in pandas?

One term that’s frequently used alongside .groupby () is split-apply-combine. This refers to a chain of three steps: It can be difficult to inspect df.groupby ("state") because it does virtually none of these things until you do something with the resulting object. Again, a Pandas GroupBy object is lazy.


1 Answers

This is what you need:

import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import datetime as dt

# df=pd.read_excel('demo.xlsx')

df['date']=pd.to_datetime(df['date'])
df['date']=df['date'].dt.strftime('%Y-%m-%d')
df['date']=pd.to_datetime(df['date'])
df['year_week'] = df['date'].dt.strftime('%Y_%U')

df['orderid_year_week']=df['order_id'].astype(str)+'_'+df['year_week']

df=df.sort_values(['category', 'subcategory','product_id'], ascending=[True, True,True])

a = (df.set_index('orderid_year_week')
     .groupby(['year_week','order_id'],sort=False)
     .agg({'category':list, 'subcategory':list,'product_id':list})).reset_index()

a['category'] =a['category'].astype(str)
a['subcategory'] =a['subcategory'].astype(str)
a['product_id'] =a['product_id'].astype(str)

df=pd.pivot_table(a,index=['year_week','category','subcategory','product_id'],values='product_id',aggfunc='count').reset_index()

df.rename({'order_id':'count'},axis=1,inplace=True)

The output looks like this (I have added a few more entries on top of the sample that you provided):

enter image description here

Some things in your explanations are not crystal clear. But let me know if this fully answers your question.

like image 58
Yashar Ahmadov Avatar answered Oct 11 '22 08:10

Yashar Ahmadov