I have a Dataframe with 1 column (+the index) containing lists of sublists or elements. I would like to detect common elements in the lists/sublists and group the lists with at least 1 common element in order to have only lists of elements without any common elements. The lists/sublists are currently like this (exemple for 4 rows):
Num_ID
Row1 [['A1','A2','A3'],['A1','B1','B2','C3','D1']]`
Row2 ['A1','E2','E3']
Row3 [['B4','B5','G4'],['B6','B4']]
Row4 ['B4','C9']
n lists with no common elements (example for the first 2):
['A1','A2','A3','B1','B2','C3','D1','E2','E3']
['B4','B5','B6','C9','G4']
Using sets Another approach to find, if two lists have common elements is to use sets. The sets have unordered collection of unique elements. So we convert the lists into sets and then create a new set by combining the given sets. If they have some common elements then the new set will not be empty.
We can also apply the reduce function in python. This function is used to apply a given function passed onto it as argument to all of the list elements mentioned in the sequence passed along. The lambda function finds out the common elements by iterating through each nested list after set is applied to them .
You can use NetworkX
's connected_components
method for this. Here's how I'd approach this adapting this solution:
import networkx as nx
from itertools import combinations, chain
df= pd.DataFrame({'Num_ID':[[['A1','A2','A3'],['A1','B1','B2','C3','D1']],
['A1','E2','E3'],
[['B4','B5','G4'],['B6','B4']],
['B4','C9']]})
Start by flattening the sublists in each list:
L = [[*chain.from_iterable(i)] if isinstance(i[0], list) else i
for i in df.Num_ID.values.tolist()]
[['A1', 'A2', 'A3', 'A1', 'B1', 'B2', 'C3', 'D1'],
['A1', 'E2', 'E3'],
['B4', 'B5', 'G4', 'B6', 'B4'],
['B4', 'C9']]
Given that the lists/sublists have more than 2 elements, you can get all the length 2 combinations from each sublist and use these as the network edges (note that edges can only connect two nodes):
L2_nested = [list(combinations(l,2)) for l in L]
L2 = list(chain.from_iterable(L2_nested))
Generate a graph, and add your list as the graph edges using add_edges_from. Then use connected_components, which will precisely give you a list of sets of the connected components in the graph:
G=nx.Graph()
G.add_edges_from(L2)
list(nx.connected_components(G))
[{'A1', 'A2', 'A3', 'B1', 'B2', 'C3', 'D1', 'E2', 'E3'},
{'B4', 'B5', 'B6', 'C9', 'G4'}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With