Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying a pandas dataframe column which has values as list

I have a dataframe as given below.

import pandas as pd


raw_data = {'score': [1,2,3], 
        'tags': [['apple','pear','guava'],['truck','car','plane'],['cat','dog','mouse']]}


df = pd.DataFrame(raw_data, columns = ['score', 'tags'])

df.query("score==1") gives the first row as result.

But df.query("tags='apple'") gives error.

How to write query for the column 'tags'.

like image 705
Vishnu.P.S Avatar asked Nov 08 '22 05:11

Vishnu.P.S


1 Answers

You cannot use pd.DataFrame.query to test membership of a string in lists within a series of lists. Holding lists in Pandas dataframes is not recommended as you lose vectorised functionality.

With your existing dataframe, you can instead calculate a mask using pd.Series.apply:

res = df[df['tags'].apply(lambda x: 'apple' in x)]

print(res)

   score                  tags
0      1  [apple, pear, guava]

Or you can use a list comprehension:

res = df[['apple' in x for x in df['tags']]]

A third option is to use set:

res = df[df['tags'].apply(set) >= {'apple'}]

The last option, although expensive, may suit when you are testing for existence of multiple tags. In each case, we are constructing a Boolean series, which we then use to mask the dataframe.

like image 151
jpp Avatar answered Nov 14 '22 21:11

jpp