Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking if a list is a subset of another in a pandas Dataframe

Tags:

python

pandas

So, i have this Dataframe with almost 3 thousand rows, that looks something like this:

        CITIES
0       ['A','B']
1       ['A','B','C','D']
2       ['A','B','C']
4       ['X']
5       ['X','Y','Z']
...     ...
2670    ['Y','Z']

I would like to remove from the DF all rows were the 'CITIES' list is contained in another row (the order does not matter), on the example above, i would like to remove 0 and 2, since both are contained in 1, and also remove 4 and 2670, since both are contained, i tried something, it kinda worked, but it was really stupid and took almost 10 minutes to compute, this was it:

indexesToRemove=[]
for index, row in entrada.iterrows():
    citiesListFixed=row['CITIES']
    for index2, row2 in entrada.iloc[index+1:].iterrows():
        citiesListCurrent=row2['CITIES']
        if set(citiesListFixed) <= set(citiesListCurrent):
            indexesToRemove.append(index)
            break

Is there a more efficient way to do this?

like image 599
Levy Barbosa Avatar asked Jan 25 '23 12:01

Levy Barbosa


1 Answers

First create the DataFrame of dummies and then we can use matrix multiplication to see if one of the rows is a complete subset of another row, by checking if the sum of multiplication with another row is equal to the number of elements in that row. (Going to be a memory intensive)

import pandas as pd
import numpy as np

df = pd.DataFrame({'Cities': [['A','B'], ['A','B','C','D'], ['A','B','C'],
                              ['X'], ['X','Y','Z'], ['Y','Z']]})    

arr = pd.get_dummies(df['Cities'].explode()).max(level=0).to_numpy()
#[[1 1 0 0 0 0 0]
# [1 1 1 1 0 0 0]
# [1 1 1 0 0 0 0]
# [0 0 0 0 1 0 0]
# [0 0 0 0 1 1 1]
# [0 0 0 0 0 1 1]]

subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0)  # So same row doesn't exclude itself

mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)

df[mask]
#         Cities
#1  [A, B, C, D]
#4     [X, Y, Z]

As it stands if you have two rows which tie for the longest subset, (i.e. two rows with ['A','B','C','D']) both are dropped. If this is not desired you can first drop_duplicates on 'Cities' (will need to covert to a hashable type like frozenset) and then apply the above.

like image 160
ALollz Avatar answered Feb 03 '23 07:02

ALollz