I have a CSV file that contains list of items visited by users, eg:
user_id item_id
370 293
471 380
280 100
280 118
219 118
...
The list is long - 30M rows.
I need to find triplets of items that were visited by three users (ie. all three users visited all three items). Such triplets are rare. Example result that i'm trying to find:
user_id item_id
1 15
1 26
1 31
77 15
77 26
77 31
45 15
45 26
45 31
What is a good way to do this? I can use Pandas or any other library.
You can use transform
with size
and then filter by boolean indexing
:
print (df)
user_id item_id
0 1 15
1 1 26
2 1 31
3 77 15
4 77 26
5 77 31
6 45 15
7 45 26
8 45 31
9 370 293
10 471 380
11 280 100
12 280 118
13 219 118
print (df.groupby('user_id')['item_id'].transform('size'))
0 3
1 3
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 1
10 1
11 2
12 2
13 1
Name: item_id, dtype: int64
print (df[df.groupby('user_id')['item_id'].transform('size') == 3])
user_id item_id
0 1 15
1 1 26
2 1 31
3 77 15
4 77 26
5 77 31
6 45 15
7 45 26
8 45 31
Solution with filtration
is slowier:
df = df.groupby('user_id').filter(lambda x: len(x.item_id) == 3)
print (df)
user_id item_id
0 1 15
1 1 26
2 1 31
3 77 15
4 77 26
5 77 31
6 45 15
7 45 26
8 45 31
You can solve your problem with the help of graph theory. In your case you are looking for complete bipartite graphs of type K3,3 (the middle one in the picture below).
You can use this answer to find all maximal complete bipartite graphs with the help of maximal cliques and then filter out K3,3 graphs.
I used networkx
package and jupyter notebook
.
import networkx as nx
from networkx.algorithms import bipartite
from itertools import combinations, chain
Let’s create a random bipartite graph (like your CSV data).
B = bipartite.random_graph(n=8, m=8, p=0.5, seed=3)
print(B.edges)
Prints:
[(0, 8), (0, 10), (0, 11), (0, 13), (0, 15), (1, 8), (1, 9), (1, 12), (1, 13), (1, 14), (2, 14), (2, 15), (3, 10), (3, 11), (3, 13), (3, 14), (4, 8), (4, 11), (4, 13), (4, 15), (5, 9), (5, 10), (5, 13), (5, 15), (6, 8), (6, 9), (6, 12), (6, 13), (6, 15), (7, 11), (7, 13)]
I used this answer to create a function to draw nice bipartite graphs.
def draw_bipart(graph, set_X, set_Y):
pos = dict()
pos.update( (n, (1, i)) for i, n in enumerate(set_X) ) # put nodes from X at x=1
pos.update( (n, (2, i)) for i, n in enumerate(set_Y) ) # put nodes from Y at x=2
nx.draw(graph, pos=pos, with_labels=True)
X, Y = bipartite.sets(B) #two sets of data (in your case user_id and item_id)
%matplotlib inline
draw_bipart(B, X, Y)
Let’s create a copy of the graph B and connect all vertices in each set (X and Y) so we can search for cliques.
connect_B = B.copy()
edges_X = combinations(X, 2)
edges_Y = combinations(Y, 2)
connect_B.add_edges_from(chain(edges_X, edges_Y))
draw_bipart(connect_B, X, Y)
Now each pair of vertices from the set X (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) is connected by an edge (not all visible due to the overlapping). The same goes for the Y set.
Now let’s search for maximal cliques:
cliques = list(nx.find_cliques(connect_B))
print(cliques)
Prints all maximal cliques:
[[2, 0, 4, 5, 6, 1, 3, 7], [2, 0, 4, 5, 6, 15], [2, 14, 1, 3], [2, 14, 15], [13, 0, 10, 11, 8, 15], [13, 0, 10, 11, 3], [13, 0, 10, 5, 3], [13, 0, 10, 5, 15], [13, 0, 4, 11, 8, 15], [13, 0, 4, 11, 3, 7], [13, 0, 4, 6, 1, 8], [13, 0, 4, 6, 1, 3, 5, 7], [13, 0, 4, 6, 15, 8], [13, 0, 4, 6, 15, 5], [13, 9, 8, 12, 6, 1], [13, 9, 8, 12, 6, 15], [13, 9, 8, 12, 14, 1], [13, 9, 8, 12, 14, 10, 11, 15], [13, 9, 5, 10, 15], [13, 9, 5, 6, 1], [13, 9, 5, 6, 15], [13, 14, 3, 1], [13, 14, 3, 10, 11]]
Now we have to filter K3,3 graphs. I use here two conditions: the K3,3 graph should have 6 vertices and 3 of them should belong to one set.
cliques_K33 = [c for c in cliques if len(c) == 6 and len(X.intersection(c)) == 3]
print(cliques_K33)
Prints:
[[13, 0, 4, 6, 15, 8]]
Finally we draw a subgraph of the graph B induced by vertices from the found K3,3 clique:
draw_bipart(B.subgraph(cliques_K33[0]), X, Y)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With