Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter Dataframe by using ~isin([list_of_substrings])

Given a dataframe full of emails, I want to filter out rows containing potentially blocked domain names or clearly fake emails. The dataframe below represents an example of my data.

>> print(df)

        email                number
1   [email protected]              2
2   [email protected]       1
3   [email protected]         5
4   [email protected]             2  
5   [email protected]             1            

I want to filter by two lists. The first list is fake_lst = ['noemail', 'noaddress', 'fake', ... 'no.email']. The second list is just the set from disposable_email_domains import blocklist converted to a list (or kept as a set).

When I use df = df[~df['email'].str.contains('noemail')] it works fine and filters out that entry. Yet when I do df = df[~df['email'].str.contains(fake_lst)] I get TypeError: unhashable type: 'list'.

The obvious answer is to use df = df[~df['email'].isin(fake_lst)] as in many other stackoverflow questions, like Filter Pandas Dataframe based on List of substrings or pandas filtering using isin function but that ends up having no effect.

I suppose I could use str.contains('string') for each possible list entry, but that is ridiculously cumbersome.

Therefore, I need to filter this dataframe based on the substrings contained in the two lists such that any email containing a particular substring I want rid of, and the subsequent row in which it is contained, are removed.

In the example above, the dataframe after filtering would be:

>> print(df)

        email                number
2   [email protected]       1
4   [email protected]             2  
5   [email protected]             1            
like image 824
DrakeMurdoch Avatar asked Oct 25 '25 06:10

DrakeMurdoch


1 Answers

Use DataFrame.isin to check whether each element in the DataFrame is contained in values. Another issue is that your fake list contains the name without the domain so you need str.split to remove the characters you are not matching against.

Note: str.contains tests if a pattern or regex is contained within a string of a Series and hence your code df['email'].str.contains('noemail') works fine but doesn't work for list

df[~df['email'].str.split('@').str[0].isin(fake_lst)]


    email                   number
0   [email protected]           2
1   [email protected]    1
3   [email protected]          2
4   [email protected]          1
like image 185
Vaishali Avatar answered Oct 26 '25 21:10

Vaishali