Let's say that I have a DataFrame containing a list or set of tags and I want to filter the DataFrame based on whether a certain tag is part of this row, what is the most idiomatic way to achieve this with pandas?
import pandas as pd
df = pd.DataFrame({
'amount': [15, 20, 40],
'tags': [["Food", "Eating Out"], ["Food", "Groceries"], ["Clothes"]],
'description': ["Garfunkel's", "Tesco", "Hollister"]
})
I have this piece of code that works, but is rather clunky to write:
criterion = lambda row: 'Food' in row['tags']
df[df.apply(criterion, axis=1)]
The result should be:
DataFrame. isin() method is used to filter/select rows from a list of values. You can have the list of values in variable and use it on isin() or use it directly.
Use pandas. DataFrame. isin() to filter a DataFrame using a list.
You can apply a lambda
to only the relevant column, instead of the whole row:
df[df['tags'].map(lambda tags: 'Food' in tags)]
For efficiency, searching list-of-string-tags every time you want to do logical indexing will be bad. So:
Expand df['tags']
into multiple columns. Either:
if there are at most T tags, add T boolean columns df['tFood'] = [ 'Food' in tt for tt in df['tags'] ]
if each item can have at most N tags and N is small, then add string columns tag1,tag2...tagN. In fact you can convert your string to Categoricals, no need to string-match every time.
Now, you can do logical indexing quickly:
df.loc[df['tFood']==True,]
# amount description tags tFood
# 0 15 Garfunkel's [Food, Eating Out] True
# 1 20 Tesco [Food, Groceries] True
Try this.Its not a perfect solution but it works.
print df[df['tags'].astype(str).str.contains('Food')]
You can even use regular expressions in contains() to match multiple patterns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With