Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't get rid of unwanted stuff while scraping email addresses

I'm trying to capture email addresses from some site's landing pages using requests in combination with re module. This is the pattern [\w\.-]+@[\w\.-]+ that I've used within the script to capture them.

When I run the script, I do get email addresses. However, I also get some unwanted stuff that resemble email addresses but in reality they are not and for that reason I would like to get rid of them.

import re
import requests

links = (
    'http://www.acupuncturetx.com',
    'http://www.hcmed.org',
    'http://www.drmindyboxer.com',
    'http://wendyrobinweir.com',
)

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

for link in links:
    r = requests.get(link,headers=headers)
    emails = re.findall(r"[\w\.-]+@[\w\.-]+",r.text)
    print(emails)

Current output:

['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
['[email protected]']
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

Expected output:

['[email protected]', '[email protected]', '[email protected]', '[email protected]']
[]
[]
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

How can I only capture email addresses and get rid of unwanted stuff using regex?

like image 887
SMTH Avatar asked Dec 07 '20 19:12

SMTH


People also ask

What is email scraping?

Email harvesting or scraping is the process of obtaining lists of email addresses using various methods. Typically these are then used for bulk email or spam.


3 Answers

Parting from where you left, you can use a simply checker to verify if it's really a valid email.

So first we define the check function:

def check(email):
    regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w+$'
    if re.match(regex, email):
        return True
    else:
        return False

Then we use it to check your itens on your email list:

for link in links:
    r = requests.get(link, headers=headers)
    emails_list = re.findall(r"[\w\.-]+@[\w\.-]+", r.text)
    emails_list = [email for email in emails_list if check(email)]
    print(emails_list)

Outputs:

['[email protected]', '[email protected]', '[email protected]', '[email protected]']
[]
[]
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
like image 139
Arthur Pereira Avatar answered Oct 17 '22 18:10

Arthur Pereira


I happen to have a regex such as this that respects RFC 5321, which will help you weed out a lot of bogus (ie: non-local) email addresses, but not all. The code can easily be adapted to ignore more stuff should you want to...

For example, email 8b4e078a51d04e0e9efdf470027f0ec1@... does look bogus, but the "local name" part is technically correct as per the RFC... You could add checks on the local-name part (will be match.group(1) in my code snippet below)

Here's my code tidbit with the RFC-compliant regex in question:

# See https://www.rfc-editor.org/rfc/rfc5321
EMAIL_REGEX = re.compile(r"([\w.~%+-]{1,64})@([\w-]{1,64}\.){1,16}(\w{2,16})", re.IGNORECASE | re.UNICODE)


# Cache this (it doesn't change often), all official top-level domains
TLD_URL = "https://datahub.io/core/top-level-domain-names/r/top-level-domain-names.csv.json"
OFFICIAL_TLD = requests.get(TLD_URL).json()
OFFICIAL_TLD = [x["Domain"].lstrip(".") for x in OFFICIAL_TLD]


def extracted_emails(text):
    for match in EMAIL_REGEX.finditer(text):
        top_level = match.group(3)
        if top_level in OFFICIAL_TLD:
            email = match.group(0)
            # Additionally, max length of domain should be at most 255
            # You could also simplify this to simply: len(email) < 255
            if len(top_level) + len(match.group(2)) < 255:
                yield email


# ... 8< ... stripped unchanged code for brevity

for link in links:
    r = requests.get(link,headers=headers)
    emails = list(extracted_emails(r.text))
    print(emails)

This yields your expected results + the one bogus (but technically correct) 8b4e078a51d04e0e9efdf470027f0ec1@... email.

It uses a regex that strictly complies to RFC 5321, and double-checks the top-level domain against the official list for each substring that looks like a valid email.

like image 29
Zoran Simic Avatar answered Oct 17 '22 18:10

Zoran Simic


Instead of only capturing e-mail addresses, you can test everything you capture with the package validate_email (pip install validate_email) and retain only valid ones. The code would be some version of the code below:

from validate_email import validate_email
emails = [x if validate_email(x) else '' for x in list_of_potential_emails]

This package checks with the corresponding server if the e-mail (or the server) exists.

like image 1
Bernardo Trindade Avatar answered Oct 17 '22 17:10

Bernardo Trindade