<p>I am trying to count the number of times users look at pages in the same session.</p> <p>I am starting with a data frame listing user_ids and the page slugs they have visited:</p> <pre class="prettyprint"><code>user_id page_view_page_slug 1 slug1 1 slug2 1 slug3 1 slug4 2 slug5 2 slug3 2 slug2 2 slug1 </code></pre> <p>What I am looking to get is a pivot table counting user_ids of the cross section of slugs</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>.</th> <th>slug1</th> <th>slug2</th> <th>slug3</th> <th>slug4</th> <th>slug5</th> </tr></thead> <tbody> <tr> <td>slug1</td> <td>2</td> <td>2</td> <td>2</td> <td>1</td> <td>1</td> </tr> <tr> <td>slug2</td> <td>2</td> <td>2</td> <td>2</td> <td>1</td> <td>1</td> </tr> <tr> <td>slug3</td> <td>2</td> <td>2</td> <td>2</td> <td>1</td> <td>1</td> </tr> <tr> <td>slug4</td> <td>1</td> <td>1</td> <td>1</td> <td>1</td> <td>0</td> </tr> <tr> <td>slug5</td> <td>1</td> <td>1</td> <td>1</td> <td>0</td> <td>1</td> </tr> </tbody> </table> </div> <p>I realize this will be the same data reflected when we see slug1 and slug2 vs slug2 and slug1 but I can't think of a better way. So far I have done a listagg</p> <pre class="prettyprint"><code>def listagg(df, grouping_idx): return df.groupby(grouping_idx).agg(list) new_df = listagg(df,'user_id') </code></pre> <p>Returning:</p> <pre class="prettyprint"><code> page_view_page_slug user_id 1 [slug1, slug2, slug3, slug4] 2 [slug5, slug3, slug2, slug2] 7 [slug6, slug4, slug7] 9 [slug3, slug5, slug1] </code></pre> <p>But I am struggling to think of loop to count when items appear in a list together (despite the order) and how to store it. Then I also do not know how I would get this in a pivotable format.</p>

<p>Here is another way by using numpy broadcasting to create a matrix which is obtained by comparing each value in <code>user_id</code> with every other value, then create a new dataframe from this matrix with <code>index</code> and <code>columns</code> set to <code>page_view_page_slug</code> and take <code>sum</code> on <code>level=0</code> along <code>axis=0</code> and <code>axis=1</code> to count the <code>user_ids</code> of the cross section of slugs:</p> <pre class="prettyprint"><code>a = df['user_id'].values i = list(df['page_view_page_slug']) pd.DataFrame(a[:, None] == a, index=i, columns=i)\ .sum(level=0).sum(level=0, axis=1).astype(int) </code></pre> <hr> <pre class="prettyprint"><code> slug1 slug2 slug3 slug4 slug5 slug1 2 2 2 1 1 slug2 2 2 2 1 1 slug3 2 2 2 1 1 slug4 1 1 1 1 0 slug5 1 1 1 0 1 </code></pre>

Creating a pandas pivot table to count number of times items appear in a list together

Tags:

python

pandas

numpy

pivot-table

I am trying to count the number of times users look at pages in the same session.

I am starting with a data frame listing user_ids and the page slugs they have visited:

user_id page_view_page_slug
1       slug1
1       slug2
1       slug3
1       slug4
2       slug5
2       slug3
2       slug2
2       slug1

What I am looking to get is a pivot table counting user_ids of the cross section of slugs

.	slug1	slug2	slug3	slug4	slug5
slug1	2	2	2	1	1
slug2	2	2	2	1	1
slug3	2	2	2	1	1
slug4	1	1	1	1	0
slug5	1	1	1	0	1

I realize this will be the same data reflected when we see slug1 and slug2 vs slug2 and slug1 but I can't think of a better way. So far I have done a listagg

def listagg(df, grouping_idx):
    return df.groupby(grouping_idx).agg(list)
new_df = listagg(df,'user_id')

Returning:

          page_view_page_slug
user_id                                                   
1        [slug1, slug2, slug3, slug4]
2        [slug5, slug3, slug2, slug2]
7        [slug6, slug4, slug7]
9        [slug3, slug5, slug1]

But I am struggling to think of loop to count when items appear in a list together (despite the order) and how to store it. Then I also do not know how I would get this in a pivotable format.

646

asked Feb 03 '21 22:02

young_matt

1 Answers

Here is another way by using numpy broadcasting to create a matrix which is obtained by comparing each value in user_id with every other value, then create a new dataframe from this matrix with index and columns set to page_view_page_slug and take sum on level=0 along axis=0 and axis=1 to count the user_ids of the cross section of slugs:

a = df['user_id'].values
i = list(df['page_view_page_slug'])

pd.DataFrame(a[:, None] == a, index=i, columns=i)\
   .sum(level=0).sum(level=0, axis=1).astype(int)

       slug1  slug2  slug3  slug4  slug5
slug1      2      2      2      1      1
slug2      2      2      2      1      1
slug3      2      2      2      1      1
slug4      1      1      1      1      0
slug5      1      1      1      0      1

114

answered Sep 25 '22 00:09

Shubham Sharma

Related questions
                            
                                How to reset Keras metrics?
                            
                                Tensorflow 2.2.0 error: [Predictions must be > 0] [Condition x >= y did not hold element-wise:] while using Bidirectional LSTM layer
                            
                                Creating a DSL expressions parser / rules engine
                            
                                Python - Find current objects in memory
                            
                                Selenium gives "Timed out receiving message from renderer" for all websites after some execution time
                            
                                How to type the __new__ method in a Python metaclass so that mypy is happy
                            
                                How to transpile python Compare ast nodes to c?
                            
                                /usr/local/bin/pip: bad interpreter: /usr/local/opt/python/bin/python3.7
                            
                                How to create any AWS Lambda Python Layer? (Usage example with XGBoost)
                            
                                Stay SOLID and DRY with coroutines and functions as methods in python
                            
                                Plotly: How to set choropleth map color for a discrete categorical variable?
                            
                                Does it make a difference if you iterate over a list or a tuple in Python?
                            
                                lint usages of functions with @deprecated decorator
                            
                                Top K Frequent Words using heaps in Python [duplicate]
                            
                                How to do multiple queries?
                            
                                await vs asyncio.run() in Python
                            
                                remove duplicate value from list of tuples based on values from another list
                            
                                Tortoise ORM for Python no returns relations of entities (Pyndantic, FastAPI)
                            
                                Can't install parquet via pip nor conda on macOS "Big Sur"
                            
                                Tunnel not found error Pyngrok streamlit collab

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With