Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: how to find triplets of items visited by triplets of users

Tags:

I have a CSV file that contains list of items visited by users, eg:

user_id item_id
370 293
471 380
280 100
280 118
219 118
...

The list is long - 30M rows.

I need to find triplets of items that were visited by three users (ie. all three users visited all three items). Such triplets are rare. Example result that i'm trying to find:

user_id item_id
1  15
1  26
1  31
77 15
77 26
77 31
45 15
45 26
45 31

What is a good way to do this? I can use Pandas or any other library.

like image 224
JustAC0der Avatar asked Mar 08 '17 08:03

JustAC0der


2 Answers

You can use transform with size and then filter by boolean indexing:

print (df)
    user_id  item_id
0         1       15
1         1       26
2         1       31
3        77       15
4        77       26
5        77       31
6        45       15
7        45       26
8        45       31
9       370      293
10      471      380
11      280      100
12      280      118
13      219      118
print (df.groupby('user_id')['item_id'].transform('size'))
0     3
1     3
2     3
3     3
4     3
5     3
6     3
7     3
8     3
9     1
10    1
11    2
12    2
13    1
Name: item_id, dtype: int64

print (df[df.groupby('user_id')['item_id'].transform('size') == 3])
   user_id  item_id
0        1       15
1        1       26
2        1       31
3       77       15
4       77       26
5       77       31
6       45       15
7       45       26
8       45       31

Solution with filtration is slowier:

df = df.groupby('user_id').filter(lambda x: len(x.item_id) == 3)
print (df)
   user_id  item_id
0        1       15
1        1       26
2        1       31
3       77       15
4       77       26
5       77       31
6       45       15
7       45       26
8       45       31
like image 161
jezrael Avatar answered Oct 10 '22 17:10

jezrael


You can solve your problem with the help of graph theory. In your case you are looking for complete bipartite graphs of type K3,3 (the middle one in the picture below).

Bipartite graphs.

You can use this answer to find all maximal complete bipartite graphs with the help of maximal cliques and then filter out K3,3 graphs.

Example:

I used networkx package and jupyter notebook.

import networkx as nx
from networkx.algorithms import bipartite
from itertools import combinations, chain

Let’s create a random bipartite graph (like your CSV data).

B = bipartite.random_graph(n=8, m=8, p=0.5, seed=3)
print(B.edges)

Prints:

[(0, 8), (0, 10), (0, 11), (0, 13), (0, 15), (1, 8), (1, 9), (1, 12), (1, 13), (1, 14), (2, 14), (2, 15), (3, 10), (3, 11), (3, 13), (3, 14), (4, 8), (4, 11), (4, 13), (4, 15), (5, 9), (5, 10), (5, 13), (5, 15), (6, 8), (6, 9), (6, 12), (6, 13), (6, 15), (7, 11), (7, 13)]

I used this answer to create a function to draw nice bipartite graphs.

def draw_bipart(graph, set_X, set_Y):
    pos = dict()
    pos.update( (n, (1, i)) for i, n in enumerate(set_X) ) # put nodes from X at x=1
    pos.update( (n, (2, i)) for i, n in enumerate(set_Y) ) # put nodes from Y at x=2
    nx.draw(graph, pos=pos, with_labels=True)

X, Y = bipartite.sets(B) #two sets of data (in your case user_id and item_id)

%matplotlib inline
draw_bipart(B, X, Y) 

Bipartite graph B

Let’s create a copy of the graph B and connect all vertices in each set (X and Y) so we can search for cliques.

connect_B = B.copy()
edges_X = combinations(X, 2)
edges_Y = combinations(Y, 2)
connect_B.add_edges_from(chain(edges_X, edges_Y))
draw_bipart(connect_B, X, Y)

Now each pair of vertices from the set X (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) is connected by an edge (not all visible due to the overlapping). The same goes for the Y set. Biparite graph connect_B

Now let’s search for maximal cliques:

cliques = list(nx.find_cliques(connect_B))
print(cliques)

Prints all maximal cliques:

[[2, 0, 4, 5, 6, 1, 3, 7], [2, 0, 4, 5, 6, 15], [2, 14, 1, 3], [2, 14, 15], [13, 0, 10, 11, 8, 15], [13, 0, 10, 11, 3], [13, 0, 10, 5, 3], [13, 0, 10, 5, 15], [13, 0, 4, 11, 8, 15], [13, 0, 4, 11, 3, 7], [13, 0, 4, 6, 1, 8], [13, 0, 4, 6, 1, 3, 5, 7], [13, 0, 4, 6, 15, 8], [13, 0, 4, 6, 15, 5], [13, 9, 8, 12, 6, 1], [13, 9, 8, 12, 6, 15], [13, 9, 8, 12, 14, 1], [13, 9, 8, 12, 14, 10, 11, 15], [13, 9, 5, 10, 15], [13, 9, 5, 6, 1], [13, 9, 5, 6, 15], [13, 14, 3, 1], [13, 14, 3, 10, 11]]

Now we have to filter K3,3 graphs. I use here two conditions: the K3,3 graph should have 6 vertices and 3 of them should belong to one set.

cliques_K33 = [c for c in cliques if len(c) == 6 and len(X.intersection(c)) == 3]
print(cliques_K33)

Prints:

[[13, 0, 4, 6, 15, 8]]

Finally we draw a subgraph of the graph B induced by vertices from the found K3,3 clique:

draw_bipart(B.subgraph(cliques_K33[0]), X, Y)

Complete bipartite graph K<sub>3,3</sub>

like image 25
Mykola Zotko Avatar answered Oct 10 '22 15:10

Mykola Zotko