Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter pandas DataFrame by membership in set-of-tags

Let's say that I have a DataFrame containing a list or set of tags and I want to filter the DataFrame based on whether a certain tag is part of this row, what is the most idiomatic way to achieve this with pandas?

import pandas as pd

df = pd.DataFrame({
    'amount': [15, 20, 40],
    'tags': [["Food", "Eating Out"], ["Food", "Groceries"], ["Clothes"]],
    'description': ["Garfunkel's", "Tesco", "Hollister"]
})

I have this piece of code that works, but is rather clunky to write:

criterion = lambda row: 'Food' in row['tags']
df[df.apply(criterion, axis=1)]

The result should be:

result

like image 585
passy Avatar asked Jan 06 '15 23:01

passy


People also ask

How do you filter a Pandas DataFrame based on a list of values?

DataFrame. isin() method is used to filter/select rows from a list of values. You can have the list of values in variable and use it on isin() or use it directly.

How do you filter a DataFrame based on a list?

Use pandas. DataFrame. isin() to filter a DataFrame using a list.


3 Answers

You can apply a lambda to only the relevant column, instead of the whole row:

df[df['tags'].map(lambda tags: 'Food' in tags)]
like image 175
Marius Avatar answered Oct 13 '22 01:10

Marius


For efficiency, searching list-of-string-tags every time you want to do logical indexing will be bad. So:

Expand df['tags'] into multiple columns. Either:

  • if there are at most T tags, add T boolean columns df['tFood'] = [ 'Food' in tt for tt in df['tags'] ]

  • if each item can have at most N tags and N is small, then add string columns tag1,tag2...tagN. In fact you can convert your string to Categoricals, no need to string-match every time.

Now, you can do logical indexing quickly:

df.loc[df['tFood']==True,]
# amount  description                tags tFood
# 0      15  Garfunkel's  [Food, Eating Out]  True
# 1      20        Tesco   [Food, Groceries]  True
like image 21
smci Avatar answered Oct 13 '22 01:10

smci


Try this.Its not a perfect solution but it works.

print df[df['tags'].astype(str).str.contains('Food')]

You can even use regular expressions in contains() to match multiple patterns.

like image 36
Charan Reddy Avatar answered Oct 12 '22 23:10

Charan Reddy