I am using Python Pandas to work with two dataframes. The first dataframe contains records from a customer database (First Name, Last Name, Email, etc). The second dataframe contains a list of domain names, e.g. gmail.com, hotmail.com, etc.
I am trying to exclude records from the customer dataframe when the email address contains a domain name from the second list. Said another way, I need to remove a customer when their email address domain appears on a domain blacklist.
Here are sample dataframes:
>>> customer = pd.DataFrame({'Email': [
"[email protected]",
"[email protected]",
"[email protected]"], 'First Name': [
"Bob",
"Jim",
"Joe"]})
>>> blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})
>>> customer
Email First Name
0 [email protected] Bob
1 [email protected] Jim
2 [email protected] Joe
>>> blacklist
Domain
0 gmail.com
1 outlook.com
My desired output would be:
>>> filtered_list = magic_happens_here(customer, blacklist)
>>> filtered_list
Email First Name
0 [email protected] Bob
1 [email protected] Jim
What I've tried so far:
df1[df1['email'].isin(~df2['email'])
... but doesn't help with the use case I'm describing here obviously.df.apply
, but couldn't get the syntax right and I imagine performance would be terrible with the actual data set. Example: df1['Email'].apply(lambda x: x for i in ['gmail.com', 'outlook.com'] if i in x)
. Although this seems like it should work, I get TypeError: 'generator' object is not callable
. Remaining questions are:
Code -
import pandas as pd
customer = pd.DataFrame({'Email': [
"[email protected]",
"[email protected]",
"[email protected]"], 'First Name': [
"Bob",
"Jim",
"Joe"]})
blacklist = pd.DataFrame({'Domain': ["gmail.com", "outlook.com"]})
invalid_emails = tuple(blacklist['Domain'])
df = customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
print(df)
Output -
Email First Name
0 [email protected] Bob
1 [email protected] Jim
try this:
customer[~customer.Email.str.endswith(invalid_emails)]
or
customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
In [399]: filtered_list
Out[399]:
Email First Name
0 [email protected] Bob
1 [email protected] Jim
explanation:
In [395]: customer.Email.str.replace(r'^[^@]*\@', '')
Out[395]:
0 example.com
1 example.com
2 gmail.com
Name: Email, dtype: object
In [396]: customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)
Out[396]:
0 False
1 False
2 True
Name: Email, dtype: bool
Timings:: against 300K rows DF:
In [401]: customer = pd.concat([customer] * 10**5)
In [402]: customer.shape
Out[402]: (300000, 2)
In [420]: %timeit customer[~customer.Email.str.endswith(invalid_emails)]
10 loops, best of 3: 136 ms per loop
In [421]: %timeit customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
10 loops, best of 3: 151 ms per loop
In [422]: %timeit customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
1 loop, best of 3: 642 ms per loop
Conclusion:
customer[~customer.Email.str.endswith(invalid_emails)]
is a bit faster compared to customer[customer['Email'].apply(lambda s: not s.endswith(invalid_emails))]
and customer[~customer.Email.str.replace(r'^[^@]*\@', '').isin(blacklist.Domain)]
is much slower
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With