Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: data cleaning - detect pattern for fraudulent email addresses

I am cleaning a data set with fraudulent email addresses that I am removing.

I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.

So I have for example rules like this:

#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))    

#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')

#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)

This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
like image 497
jeangelj Avatar asked Jan 04 '23 20:01

jeangelj


1 Answers

My solution isn't efficient, nor pretty. But check it out and see if it works for you @jeangelj. It definitely works for the examples you provided. Good luck!

import os
from random import shuffle
from difflib import SequenceMatcher

emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]

T = 0.7 # <- set your string similarity threshold here!!

split_indices=[]
for i in range(1,len(emails)):
    if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
        split_indices.append(i) # we want to remember where dissimilar email address occurs

grouped=[]
for i in split_indices:
    grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
    prefix_strings.append(os.path.commonprefix(group))

# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
    if i in true_ids:
        ham.append(emails[i])
    else:
        spam.append(emails[i])

In [30]: ham
Out[30]: ['[email protected]', '[email protected]']

In [31]: spam
Out[31]: 
['[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]']  

# THE TRUTH YALL!
like image 125
Blue482 Avatar answered Jan 14 '23 02:01

Blue482