I have a dataframe and a list:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
'char':[['a','b'],['a','b','c'],['a','c'],['b','c'],[],['c','a','d'],['c','d'],['a']]})
names = ['a','c']
I want to get rows only if both a
and c
both are present in char
column.(order doesn't matter here)
Expected Output:
char id
1 [a, b, c] 2
2 [a, c] 3
5 [c, a, d] 6
My Efforts
true_indices = []
for idx, row in df.iterrows():
if all(name in row['char'] for name in names):
true_indices.append(idx)
ids = df[df.index.isin(true_indices)]
Which is giving me correct output but it is too slow for large dataset so I am looking for more efficient solution.
Method 1: Use isin() function In this scenario, the isin() function check the pandas column containing the string present in the list and return the column values when present, otherwise it will not select the dataframe columns.
Pandas. Series. isin() function is used to check whether a column contains a list of multiple values. It returns a boolean Series showing each element in the Series matches an element in the passed sequence of values exactly.
Check if Column Exists in Pandas using issubset() To check whether the 'CarName' and 'Price' columns exist in Dataframe or not using issubset() function.
You can build a set from the list of names for a faster lookup, and use set.issubset
to check if all elements in the set are contained in the column lists:
names = set(['a','c'])
df[df['char'].map(names.issubset)]
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Use list comprehension with issubset
:
mask = [set(names).issubset(x) for x in df['char']]
df = df[mask]
print (df)
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Another solution with Series.map
:
df = df[df['char'].map(set(names).issubset)]
print (df)
id char
1 2 [a, b, c]
2 3 [a, c]
5 6 [c, a, d]
Performance Depends of number of rows and number of matched values:
df = pd.concat([df] * 10000, ignore_index=True)
In [270]: %timeit df[df['char'].apply(lambda x: set(names).issubset(x))]
45.9 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [271]: %%timeit
...: names = set(['a','c'])
...: [names.issubset(set(row)) for _,row in df.char.iteritems()]
...:
46.7 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [272]: %%timeit
...: df[[set(names).issubset(x) for x in df['char']]]
...:
45.6 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [273]: %%timeit
...: df[df['char'].map(set(names).issubset)]
...:
18.3 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [274]: %%timeit
...: n = set(names)
...: df[df['char'].map(n.issubset)]
...:
16.6 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [279]: %%timeit
...: names = set(['a','c'])
...: m = [name.issubset(i) for i in df.char.values.tolist()]
...:
19.2 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With