I have dataframe that has values like those:
A B
["I need avocado" "something"]
["something" "I eat margarina"]
And I want to find rows that:
In any column of the row, the column's value is contained in a list. for example, for the list:
["apple","avocado","bannana"]
And only this line should match: ["I need avocado" "something"]
This line doesnt work:
dataFiltered[dataFiltered[col].str.contains(*includeKeywords)]
Returns:
{TypeError}unsupported operand type(s) for &: 'str' and 'int'
What should I do?
df = pd.DataFrame(dict(
A=['I need avocado', 'something', 'useless', 'nothing'],
B=['something', 'I eat margarina', 'eat apple', 'more nothing']
))
includeKeywords = ["apple", "avocado", "bannana"]
A B
0 I need avocado something # True 'avocado' in A
1 something I eat margarina
2 useless eat apple # True 'apple' in B
3 nothing more nothing
pandas.DataFrame.stack
to make df
a Series
and enable us to use the pandas.Series.str
accessor functionspandas.Series.str.contains
with '|'.join(includeKeywords)
pandas.Series.any
with argument level=0
because we added a level to the index when we stackeddf[df.stack().str.contains('|'.join(includeKeywords)).any(level=0)]
A B
0 I need avocado something
2 useless eat apple
This produces a regex
search string. In regex
, '|'
means or
. So for a regex
search, this says match 'apple'
, 'avocado'
, or 'bannana'
kwstr = '|'.join(includeKeywords)
print(kwstr)
apple|avocado|bannana
Stacking will flatten our DataFrame
df.stack()
0 A I need avocado
B something
1 A something
B I eat margarina
2 A useless
B eat apple
3 A nothing
B more nothing
dtype: object
Fortunately, the pandas.Series.str.contains
method can handle regex
and it will produce a boolean Series
df.stack().str.contains(kwstr)
0 A True
B False
1 A False
B False
2 A False
B True
3 A False
B False
dtype: bool
At which point we can cleverly use pandas.Series.any
by suggesting it only care about level=0
mask = df.stack().str.contains(kwstr).any(level=0)
mask
0 True
1 False
2 True
3 False
dtype: bool
By using level=0
we preserved the original index in the resulting Series
. This makes it perfect for filtering df
df[mask]
A B
0 I need avocado something
2 useless eat apple
Take advantage of the any()
function and use a list comprenesion in an df.apply()
df = pd.DataFrame(["I need avocado","I eat margarina"])
print(df)
# 0
# 0 I need avocado
# 1 I eat margarina
includeKeywords = ["apple","avocado","bannana"]
print(df[df.apply(lambda r: any([kw in r[0] for kw in includeKeywords]), axis=1)])
# 0
# 0 I need avocado
To make this a bit clearer, you basically need to make a mask that returns True/False for each row
mask = [any([kw in r for kw in includeKeywords]) for r in df[0]]
print(mask)
Then you can use that mask to print the selected rows in your DataFrame
# [True, False]
print(df[mask])
# 0
# 0 I need avocado
I am showing you both ways because while the df.apply()
method is handy for a one liner, it is really slow compared to a standard list comprehension. So if you have a small enough set, feel free to use df.apply()
. Otherwise, I'd suggest the python comprehension over the pandas method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With