Python: how to find triplets of items visited by triplets of users

Question

I have a CSV file that contains list of items visited by users, eg:

user_id item_id
370 293
471 380
280 100
280 118
219 118
...

The list is long - 30M rows.

I need to find triplets of items that were visited by three users (ie. all three users visited all three items). Such triplets are rare. Example result that i'm trying to find:

user_id item_id
1  15
1  26
1  31
77 15
77 26
77 31
45 15
45 26
45 31

What is a good way to do this? I can use Pandas or any other library.

jezrael · Accepted Answer

You can use transform with size and then filter by boolean indexing:

print (df)
    user_id  item_id
0         1       15
1         1       26
2         1       31
3        77       15
4        77       26
5        77       31
6        45       15
7        45       26
8        45       31
9       370      293
10      471      380
11      280      100
12      280      118
13      219      118

print (df.groupby('user_id')['item_id'].transform('size'))
0     3
1     3
2     3
3     3
4     3
5     3
6     3
7     3
8     3
9     1
10    1
11    2
12    2
13    1
Name: item_id, dtype: int64

print (df[df.groupby('user_id')['item_id'].transform('size') == 3])
   user_id  item_id
0        1       15
1        1       26
2        1       31
3       77       15
4       77       26
5       77       31
6       45       15
7       45       26
8       45       31

Solution with filtration is slowier:

df = df.groupby('user_id').filter(lambda x: len(x.item_id) == 3)
print (df)
   user_id  item_id
0        1       15
1        1       26
2        1       31
3       77       15
4       77       26
5       77       31
6       45       15
7       45       26
8       45       31

Mykola Zotko · Answer

You can solve your problem with the help of graph theory. In your case you are looking for complete bipartite graphs of type K_3,3 (the middle one in the picture below).

Bipartite graphs.

You can use this answer to find all maximal complete bipartite graphs with the help of maximal cliques and then filter out K_3,3 graphs.

Example:

I used networkx package and jupyter notebook.

import networkx as nx
from networkx.algorithms import bipartite
from itertools import combinations, chain

Let’s create a random bipartite graph (like your CSV data).

B = bipartite.random_graph(n=8, m=8, p=0.5, seed=3)
print(B.edges)

Prints:

[(0, 8), (0, 10), (0, 11), (0, 13), (0, 15), (1, 8), (1, 9), (1, 12), (1, 13), (1, 14), (2, 14), (2, 15), (3, 10), (3, 11), (3, 13), (3, 14), (4, 8), (4, 11), (4, 13), (4, 15), (5, 9), (5, 10), (5, 13), (5, 15), (6, 8), (6, 9), (6, 12), (6, 13), (6, 15), (7, 11), (7, 13)]

I used this answer to create a function to draw nice bipartite graphs.

def draw_bipart(graph, set_X, set_Y):
    pos = dict()
    pos.update( (n, (1, i)) for i, n in enumerate(set_X) ) # put nodes from X at x=1
    pos.update( (n, (2, i)) for i, n in enumerate(set_Y) ) # put nodes from Y at x=2
    nx.draw(graph, pos=pos, with_labels=True)

X, Y = bipartite.sets(B) #two sets of data (in your case user_id and item_id)

%matplotlib inline
draw_bipart(B, X, Y)

Bipartite graph B

Let’s create a copy of the graph B and connect all vertices in each set (X and Y) so we can search for cliques.

connect_B = B.copy()
edges_X = combinations(X, 2)
edges_Y = combinations(Y, 2)
connect_B.add_edges_from(chain(edges_X, edges_Y))
draw_bipart(connect_B, X, Y)

Now each pair of vertices from the set X (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) is connected by an edge (not all visible due to the overlapping). The same goes for the Y set. Biparite graph connect_B

Now let’s search for maximal cliques:

cliques = list(nx.find_cliques(connect_B))
print(cliques)

Prints all maximal cliques:

[[2, 0, 4, 5, 6, 1, 3, 7], [2, 0, 4, 5, 6, 15], [2, 14, 1, 3], [2, 14, 15], [13, 0, 10, 11, 8, 15], [13, 0, 10, 11, 3], [13, 0, 10, 5, 3], [13, 0, 10, 5, 15], [13, 0, 4, 11, 8, 15], [13, 0, 4, 11, 3, 7], [13, 0, 4, 6, 1, 8], [13, 0, 4, 6, 1, 3, 5, 7], [13, 0, 4, 6, 15, 8], [13, 0, 4, 6, 15, 5], [13, 9, 8, 12, 6, 1], [13, 9, 8, 12, 6, 15], [13, 9, 8, 12, 14, 1], [13, 9, 8, 12, 14, 10, 11, 15], [13, 9, 5, 10, 15], [13, 9, 5, 6, 1], [13, 9, 5, 6, 15], [13, 14, 3, 1], [13, 14, 3, 10, 11]]

Now we have to filter K_3,3 graphs. I use here two conditions: the K_3,3 graph should have 6 vertices and 3 of them should belong to one set.

cliques_K33 = [c for c in cliques if len(c) == 6 and len(X.intersection(c)) == 3]
print(cliques_K33)

Prints:

[[13, 0, 4, 6, 15, 8]]

Finally we draw a subgraph of the graph B induced by vertices from the found K_3,3 clique:

draw_bipart(B.subgraph(cliques_K33[0]), X, Y)

Complete bipartite graph K<sub>3,3</sub>

Python: how to find triplets of items visited by triplets of users

Tags:

JustAC0der

2 Answers

jezrael

Example:

Mykola Zotko

Recent Activity

Donate For Us

Python: how to find triplets of items visited by triplets of users

Tags:

JustAC0der

2 Answers

jezrael

Example:

Mykola Zotko

Related questions

Recent Activity

Donate For Us