I have a pandas dataframe
that contains a list in column split_categories:
df.head()
album_id categories split_categories
0 66562 480.494 [480, 494]
1 114582 128 [128]
2 4846 5 [5]
3 1709 9 [9]
4 59239 105.104 [105, 104]
I would like to select all the rows where the at least one category in a specific list [480, 9, 104].
Expected output:
album_id categories split_categories
0 66562 480.494 [480, 494]
3 1709 9 [9]
4 59239 105.104 [105, 104]
I manage to do it using apply
:
def match_categories(row):
selected_categories = [480, 9, 104]
result = [int(i) for i in row['split_categories'] if i in selected_categories]
return result
df['matched_categories'] = df.apply(match_categories, axis=1)
But this code runs on production and this way takes too long (I run it for multiple columns containing lists)
Is there a way to run something like:
df[~(df['split_categories'].anyvalue.isin([480, 9, 104]))]
Thanks
In this post, we will see different ways to filter Pandas Dataframe by column values. First, Let’s create a Dataframe: Method 1: Selecting rows of Pandas Dataframe based on particular column value using ‘>’, ‘=’, ‘=’, ‘<=’, ‘!=’ operator. Example 1: Selecting all the rows from the given Dataframe in which ‘Percentage’ is greater than 75 using [ ].
To filter rows of a dataframe on a set or collection of values you can use the isin () membership function. This way, you can have only the rows that you’d like to keep based on the list values. The following is the syntax: Here, allowed_values is the list of values of column Col1 that you want to filter the dataframe for.
In Python we can check if an item is in a list by using the in keyword: However, this doesn’t work in pandas. Fortunately, there’s the isin () method. We just need to pass in the list of values we want to filter by: The pandas query () method takes a String expression as the filter criteria. For example in the case of a single value:
Inplace represents the changes in the dataframe if it is true and hence by default it is false. Expression represents the string expression to filter data. It finally returns the filtered rows of the dataframe. How to Filter Rows in Pandas?
You can convert each list to sets, get intersection and convert to bool:
L = [480, 9, 104]
mask = np.array([bool(set(map(int, x)) & set(L)) for x in df['split_categories']])
Or convert list column
to DataFrame
, cast to float and compare with isin
:
df1 = pd.DataFrame(df['split_categories'].values.tolist(), index=df.index)
mask = df1.astype(float).isin(L).any(axis=1)
df = df[mask]
print (df)
album_id categories split_categories
0 66562 480.494 [480, 494]
3 1709 9 [9]
4 59239 105.104 [105, 104]
You can expand the inner list, and check if any
items in the inner lists are contained in [480, 9, 104]
:
l = [480, 9, 104]
df[df.categories.str.split('.', expand=True).isin(map(str,l)).any(axis=1)]
album_id categories split_categories
0 66562 480.494 [480,494]
3 1709 9.000 [9]
4 59239 105.104 [105,104]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With