Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if all the elements in list are present in pandas column

I have a dataframe and a list:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 
    'char':[['a','b'],['a','b','c'],['a','c'],['b','c'],[],['c','a','d'],['c','d'],['a']]})

names = ['a','c']

I want to get rows only if both a and c both are present in char column.(order doesn't matter here)

Expected Output:

       char  id                                                                                                                      
1  [a, b, c]   2                                                                                                                      
2     [a, c]   3                                                                                                                      
5  [c, a, d]   6   

My Efforts

true_indices = []
for idx, row in df.iterrows():
    if all(name in row['char'] for name in names):
        true_indices.append(idx)


ids = df[df.index.isin(true_indices)]

Which is giving me correct output but it is too slow for large dataset so I am looking for more efficient solution.

like image 897
Sociopath Avatar asked Apr 18 '19 11:04

Sociopath


People also ask

How do you check if a list of items is present in a DataFrame column?

Method 1: Use isin() function In this scenario, the isin() function check the pandas column containing the string present in the list and return the column values when present, otherwise it will not select the dataframe columns.

How do you check if a list of values is in a column pandas?

Pandas. Series. isin() function is used to check whether a column contains a list of multiple values. It returns a boolean Series showing each element in the Series matches an element in the passed sequence of values exactly.

How do you check if a column is present in pandas DataFrame?

Check if Column Exists in Pandas using issubset() To check whether the 'CarName' and 'Price' columns exist in Dataframe or not using issubset() function.


2 Answers

You can build a set from the list of names for a faster lookup, and use set.issubset to check if all elements in the set are contained in the column lists:

names = set(['a','c'])
df[df['char'].map(names.issubset)]

   id       char
1   2  [a, b, c]
2   3     [a, c]
5   6  [c, a, d]
like image 192
yatu Avatar answered Nov 04 '22 12:11

yatu


Use list comprehension with issubset:

mask = [set(names).issubset(x) for x in df['char']]
df = df[mask]
print (df)
   id       char
1   2  [a, b, c]
2   3     [a, c]
5   6  [c, a, d]

Another solution with Series.map:

df = df[df['char'].map(set(names).issubset)]
print (df)
   id       char
1   2  [a, b, c]
2   3     [a, c]
5   6  [c, a, d]

Performance Depends of number of rows and number of matched values:

df = pd.concat([df] * 10000, ignore_index=True)

In [270]: %timeit df[df['char'].apply(lambda x: set(names).issubset(x))]
45.9 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [271]: %%timeit
     ...: names = set(['a','c'])
     ...: [names.issubset(set(row)) for _,row in df.char.iteritems()]
     ...: 
46.7 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [272]: %%timeit
     ...: df[[set(names).issubset(x) for x in df['char']]]
     ...: 
45.6 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [273]: %%timeit
     ...: df[df['char'].map(set(names).issubset)]
     ...: 
18.3 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [274]: %%timeit
     ...: n = set(names)
     ...: df[df['char'].map(n.issubset)]
     ...: 
16.6 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [279]: %%timeit
     ...: names = set(['a','c'])
     ...: m = [name.issubset(i) for i in df.char.values.tolist()]
     ...: 
19.2 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 28
jezrael Avatar answered Nov 04 '22 12:11

jezrael