I am cleaning a data set with fraudulent email addresses that I am removing.
I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.
So I have for example rules like this:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
My solution isn't efficient, nor pretty. But check it out and see if it works for you @jeangelj. It definitely works for the examples you provided. Good luck!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['[email protected]', '[email protected]']
In [31]: spam
Out[31]:
['[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]']
# THE TRUTH YALL!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With