I have dataframe that has values like those:
  A                    B
["I need avocado"   "something"]
["something"      "I eat margarina"]
And I want to find rows that:
In any column of the row, the column's value is contained in a list. for example, for the list:
["apple","avocado","bannana"]
And only this line should match: ["I need avocado" "something"]
This line doesnt work:
dataFiltered[dataFiltered[col].str.contains(*includeKeywords)]
Returns:
{TypeError}unsupported operand type(s) for &: 'str' and 'int'
What should I do?
df = pd.DataFrame(dict(
    A=['I need avocado', 'something', 'useless', 'nothing'],
    B=['something', 'I eat margarina', 'eat apple', 'more nothing']
))
includeKeywords = ["apple", "avocado", "bannana"]
                A                B
0  I need avocado        something  # True 'avocado' in A
1       something  I eat margarina
2         useless        eat apple  # True 'apple' in B
3         nothing     more nothing
pandas.DataFrame.stack to make df a Series and enable us to use the pandas.Series.str accessor functionspandas.Series.str.contains with '|'.join(includeKeywords)
pandas.Series.any with argument level=0 because we added a level to the index when we stackeddf[df.stack().str.contains('|'.join(includeKeywords)).any(level=0)]
                A          B
0  I need avocado  something
2         useless  eat apple
This produces a regex search string.  In regex, '|' means or.  So for a regex search, this says match 'apple', 'avocado', or 'bannana'
kwstr = '|'.join(includeKeywords)
print(kwstr)
apple|avocado|bannana
Stacking will flatten our DataFrame
df.stack()
0  A     I need avocado
   B          something
1  A          something
   B    I eat margarina
2  A            useless
   B          eat apple
3  A            nothing
   B       more nothing
dtype: object
Fortunately, the pandas.Series.str.contains method can handle regex and it will produce a boolean Series
df.stack().str.contains(kwstr)
0  A     True
   B    False
1  A    False
   B    False
2  A    False
   B     True
3  A    False
   B    False
dtype: bool
At which point we can cleverly use pandas.Series.any by suggesting it only care about level=0
mask = df.stack().str.contains(kwstr).any(level=0)
mask
0     True
1    False
2     True
3    False
dtype: bool
By using level=0 we preserved the original index in the resulting Series. This makes it perfect for filtering df
df[mask]
                A          B
0  I need avocado  something
2         useless  eat apple
                        Take advantage of the any() function and use a list comprenesion in an df.apply()
df = pd.DataFrame(["I need avocado","I eat margarina"])
print(df)
#                  0
# 0   I need avocado
# 1  I eat margarina
includeKeywords = ["apple","avocado","bannana"]
print(df[df.apply(lambda r: any([kw in r[0] for kw in includeKeywords]), axis=1)])
#                 0
# 0  I need avocado
To make this a bit clearer, you basically need to make a mask that returns True/False for each row
mask = [any([kw in r for kw in includeKeywords]) for r in df[0]]
print(mask)
Then you can use that mask to print the selected rows in your DataFrame
# [True, False]
print(df[mask])
#                 0
# 0  I need avocado
I am showing you both ways because while the df.apply() method is handy for a one liner, it is really slow compared to a standard list comprehension. So if you have a small enough set, feel free to use df.apply(). Otherwise, I'd suggest the python comprehension over the pandas method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With