Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter pandas dataframe rows if any value on a list inside the dataframe is in another list

I have a pandas dataframe that contains a list in column split_categories:

df.head()

      album_id categories split_categories
    0    66562    480.494       [480, 494]
    1   114582        128            [128]
    2     4846          5              [5]
    3     1709          9              [9]
    4    59239    105.104       [105, 104]

I would like to select all the rows where the at least one category in a specific list [480, 9, 104].

Expected output:

  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]

I manage to do it using apply:

def match_categories(row):
    selected_categories =  [480, 9, 104]
    result = [int(i) for i in row['split_categories'] if i in selected_categories]
    return result

df['matched_categories'] = df.apply(match_categories, axis=1)

But this code runs on production and this way takes too long (I run it for multiple columns containing lists)

Is there a way to run something like:

df[~(df['split_categories'].anyvalue.isin([480, 9, 104]))]

Thanks

like image 531
Ary Jazz Avatar asked Jan 16 '19 09:01

Ary Jazz


People also ask

How to filter pandas Dataframe based on particular column values?

In this post, we will see different ways to filter Pandas Dataframe by column values. First, Let’s create a Dataframe: Method 1: Selecting rows of Pandas Dataframe based on particular column value using ‘>’, ‘=’, ‘=’, ‘<=’, ‘!=’ operator. Example 1: Selecting all the rows from the given Dataframe in which ‘Percentage’ is greater than 75 using [ ].

How to filter rows of a Dataframe on a list?

To filter rows of a dataframe on a set or collection of values you can use the isin () membership function. This way, you can have only the rows that you’d like to keep based on the list values. The following is the syntax: Here, allowed_values is the list of values of column Col1 that you want to filter the dataframe for.

How to check if an item is in a list in pandas?

In Python we can check if an item is in a list by using the in keyword: However, this doesn’t work in pandas. Fortunately, there’s the isin () method. We just need to pass in the list of values we want to filter by: The pandas query () method takes a String expression as the filter criteria. For example in the case of a single value:

What is inplace in pandas Dataframe?

Inplace represents the changes in the dataframe if it is true and hence by default it is false. Expression represents the string expression to filter data. It finally returns the filtered rows of the dataframe. How to Filter Rows in Pandas?


2 Answers

You can convert each list to sets, get intersection and convert to bool:

L = [480, 9, 104]
mask = np.array([bool(set(map(int, x)) & set(L))  for x in df['split_categories']])

Or convert list column to DataFrame, cast to float and compare with isin:

df1 = pd.DataFrame(df['split_categories'].values.tolist(), index=df.index)
mask = df1.astype(float).isin(L).any(axis=1)

df = df[mask]
print (df)
  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]
like image 129
jezrael Avatar answered Oct 13 '22 07:10

jezrael


You can expand the inner list, and check if any items in the inner lists are contained in [480, 9, 104]:

l = [480, 9, 104]
df[df.categories.str.split('.', expand=True).isin(map(str,l)).any(axis=1)]

   album_id  categories split_categories
0     66562     480.494        [480,494]
3      1709       9.000              [9]
4     59239     105.104        [105,104]
like image 44
yatu Avatar answered Oct 13 '22 07:10

yatu