Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas - something like ISIN but "contains" vs "exact" match

I am using Python Pandas to work with two dataframes. The first dataframe contains records from a customer database (First Name, Last Name, Email, etc). The second dataframe contains a list of domain names, e.g. gmail.com, hotmail.com, etc.

I am trying to exclude records from the customer dataframe when the email address contains a domain name from the second list. Said another way, I need to remove a customer when their email address domain appears on a domain blacklist.

Here are sample dataframes:

>>> customer = pd.DataFrame({'Email': [
    "bob@example.com", 
    "jim@example.com", 
    "joe@gmail.com"], 'First Name': [
    "Bob", 
    "Jim", 
    "Joe"]})

>>> blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})

>>> customer
         Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim
2    joe@gmail.com        Joe
>>> blacklist
  Domain
0  gmail.com
1  outlook.com

My desired output would be:

>>> filtered_list = magic_happens_here(customer, blacklist)
>>> filtered_list
    Email First Name
0 bob@example.com    Bob
1 jim@example.com    Jim

What I've tried so far:

  1. To eliminate specific email addresses, in the past I've used df1[df1['email'].isin(~df2['email']) ... but doesn't help with the use case I'm describing here obviously.
  2. I've tried using df.apply, but couldn't get the syntax right and I imagine performance would be terrible with the actual data set. Example: df1['Email'].apply(lambda x: x for i in ['gmail.com', 'outlook.com'] if i in x). Although this seems like it should work, I get TypeError: 'generator' object is not callable.

Remaining questions are:

  1. What's the best approach here?
  2. Why is the generator not callable?
  3. ... and ultimately, how can I exclude customers from a dataframe when an email address domain exists within an exclusion set?
like image 823
Marco Benvoglio Avatar asked Jun 02 '16 16:06

Marco Benvoglio


2 Answers

Code -

import pandas as pd


customer = pd.DataFrame({'Email': [
    "bob@example.com",
    "jim@example.com", 
    "joe@gmail.com"], 'First Name': [
    "Bob", 
    "Jim", 
    "Joe"]})

blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})

invalid_emails = tuple(blacklist['Domain'])

df = customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]

print(df)

Output -

             Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim
like image 152
Vedang Mehta Avatar answered Nov 16 '22 17:11

Vedang Mehta


try this:

customer[~customer.Email.str.endswith(invalid_emails)]

or

customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]

In [399]: filtered_list
Out[399]:
             Email First Name
0  bob@example.com        Bob
1  jim@example.com        Jim

explanation:

In [395]: customer.Email.str.replace(r'^[^@]*\@', '')
Out[395]:
0    example.com
1    example.com
2      gmail.com
Name: Email, dtype: object

In [396]: customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)
Out[396]:
0    False
1    False
2     True
Name: Email, dtype: bool

Timings:: against 300K rows DF:

In [401]: customer = pd.concat([customer] * 10**5)

In [402]: customer.shape
Out[402]: (300000, 2)

In [420]: %timeit customer[~customer.Email.str.endswith(invalid_emails)]
10 loops, best of 3: 136 ms per loop

In [421]: %timeit customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
10 loops, best of 3: 151 ms per loop

In [422]: %timeit customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
1 loop, best of 3: 642 ms per loop

Conclusion:

customer[~customer.Email.str.endswith(invalid_emails)] is a bit faster compared to customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))] and customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)] is much slower

like image 4
MaxU - stop WAR against UA Avatar answered Nov 16 '22 17:11

MaxU - stop WAR against UA