python: data cleaning - detect pattern for fraudulent email addresses

Question

I am cleaning a data set with fraudulent email addresses that I am removing.

I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.

So I have for example rules like this:

#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))    

#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')

#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)

This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.

abc7020@gmail.com
abc7020.1@gmail.com
abc7020.10@gmail.com
abc7020.11@gmail.com
abc7020.12@gmail.com
abc7020.13@gmail.com
abc7020.14@gmail.com
abc7020.15@gmail.com
attn1@gmail.com
attn12@gmail.com
attn123@gmail.com
attn1234@gmail.com
attn12345@gmail.com
attn123456@gmail.com
attn1234567@gmail.com
attn12345678@gmail.com

Blue482 · Accepted Answer

My solution isn't efficient, nor pretty. But check it out and see if it works for you @jeangelj. It definitely works for the examples you provided. Good luck!

import os
from random import shuffle
from difflib import SequenceMatcher

emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]

T = 0.7 # <- set your string similarity threshold here!!

split_indices=[]
for i in range(1,len(emails)):
    if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
        split_indices.append(i) # we want to remember where dissimilar email address occurs

grouped=[]
for i in split_indices:
    grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
    prefix_strings.append(os.path.commonprefix(group))

# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
    if i in true_ids:
        ham.append(emails[i])
    else:
        spam.append(emails[i])

In [30]: ham
Out[30]: ['abc7020@gmail.com', 'attn1@gmail.com']

In [31]: spam
Out[31]: 
['abc7020.10@gmail.com',
 'abc7020.11@gmail.com',
 'abc7020.12@gmail.com',
 'abc7020.13@gmail.com',
 'abc7020.14@gmail.com',
 'abc7020.15@gmail.com',
 'abc7020.1@gmail.com',
 'attn12345678@gmail.com',
 'attn1234567@gmail.com',
 'attn123456@gmail.com',
 'attn12345@gmail.com',
 'attn1234@gmail.com',
 'attn123@gmail.com',
 'attn12@gmail.com']  

# THE TRUTH YALL!

python: data cleaning - detect pattern for fraudulent email addresses

Tags:

python

data-cleaning

jeangelj

1 Answers

Blue482

Recent Activity

Donate For Us

python: data cleaning - detect pattern for fraudulent email addresses

Tags:

python

data-cleaning

jeangelj

1 Answers

Blue482

Related questions

Recent Activity

Donate For Us