I'm trying to capture email addresses from some site's landing pages using requests in combination with re module. This is the pattern <code>[\w\.-]+@[\w\.-]+</code> that I've used within the script to capture them. When I run the script, I do get email addresses. However, I also get some unwanted stuff that resemble email addresses but in reality they are not and for that reason I would like to get rid of them. <pre class="prettyprint"><code>import re import requests links = ( 'http://www.acupuncturetx.com', 'http://www.hcmed.org', 'http://www.drmindyboxer.com', 'http://wendyrobinweir.com', ) headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"} for link in links: r = requests.get(link,headers=headers) emails = re.findall(r"[\w\.-]+@[\w\.-]+",r.text) print(emails) </code></pre> Current output: <pre class="prettyprint"><code>['react@16.5.2', 'react-dom@16.5.2', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com'] ['hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x.png', 'hh-logo@2x-300x47.png'] ['leaflet@1.7.1'] ['8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress.com', 'requirejs-bolt@2.3.6', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wixstores-client-cart-icon@1.797.0', 'wixstores-client-gallery@1.1634.0'] </code></pre> Expected output: <pre class="prettyprint"><code>['bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com', 'bai@acupuncturetx.com'] [] [] ['wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com', 'wendyrobin16@gmail.com'] </code></pre> How can I only capture email addresses and get rid of unwanted stuff using regex?

Instead of only capturing e-mail addresses, you can test everything you capture with the package validate_email (<code>pip install validate_email</code>) and retain only valid ones. The code would be some version of the code below: <pre class="prettyprint lang-py prettyprint-override"><code>from validate_email import validate_email emails = [x if validate_email(x) else '' for x in list_of_potential_emails] </code></pre> This package checks with the corresponding server if the e-mail (or the server) exists.

Can't get rid of unwanted stuff while scraping email addresses

Tags:

python

regex

python-3.x

web-scraping

I'm trying to capture email addresses from some site's landing pages using requests in combination with re module. This is the pattern [\w\.-]+@[\w\.-]+ that I've used within the script to capture them.

When I run the script, I do get email addresses. However, I also get some unwanted stuff that resemble email addresses but in reality they are not and for that reason I would like to get rid of them.

import re
import requests

links = (
    'http://www.acupuncturetx.com',
    'http://www.hcmed.org',
    'http://www.drmindyboxer.com',
    'http://wendyrobinweir.com',
)

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

for link in links:
    r = requests.get(link,headers=headers)
    emails = re.findall(r"[\w\.-]+@[\w\.-]+",r.text)
    print(emails)

Current output:

['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
['[email protected]']
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

Expected output:

['[email protected]', '[email protected]', '[email protected]', '[email protected]']
[]
[]
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

How can I only capture email addresses and get rid of unwanted stuff using regex?

887

asked Dec 07 '20 19:12

SMTH

3 Answers

Parting from where you left, you can use a simply checker to verify if it's really a valid email.

So first we define the check function:

def check(email):
    regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w+$'
    if re.match(regex, email):
        return True
    else:
        return False

Then we use it to check your itens on your email list:

for link in links:
    r = requests.get(link, headers=headers)
    emails_list = re.findall(r"[\w\.-]+@[\w\.-]+", r.text)
    emails_list = [email for email in emails_list if check(email)]
    print(emails_list)

Outputs:

['[email protected]', '[email protected]', '[email protected]', '[email protected]']
[]
[]
['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

139

answered Oct 17 '22 18:10

Arthur Pereira

I happen to have a regex such as this that respects RFC 5321, which will help you weed out a lot of bogus (ie: non-local) email addresses, but not all. The code can easily be adapted to ignore more stuff should you want to...

For example, email 8b4e078a51d04e0e9efdf470027f0ec1@... does look bogus, but the "local name" part is technically correct as per the RFC... You could add checks on the local-name part (will be match.group(1) in my code snippet below)

Here's my code tidbit with the RFC-compliant regex in question:

# See https://www.rfc-editor.org/rfc/rfc5321
EMAIL_REGEX = re.compile(r"([\w.~%+-]{1,64})@([\w-]{1,64}\.){1,16}(\w{2,16})", re.IGNORECASE | re.UNICODE)


# Cache this (it doesn't change often), all official top-level domains
TLD_URL = "https://datahub.io/core/top-level-domain-names/r/top-level-domain-names.csv.json"
OFFICIAL_TLD = requests.get(TLD_URL).json()
OFFICIAL_TLD = [x["Domain"].lstrip(".") for x in OFFICIAL_TLD]


def extracted_emails(text):
    for match in EMAIL_REGEX.finditer(text):
        top_level = match.group(3)
        if top_level in OFFICIAL_TLD:
            email = match.group(0)
            # Additionally, max length of domain should be at most 255
            # You could also simplify this to simply: len(email) < 255
            if len(top_level) + len(match.group(2)) < 255:
                yield email


# ... 8< ... stripped unchanged code for brevity

for link in links:
    r = requests.get(link,headers=headers)
    emails = list(extracted_emails(r.text))
    print(emails)

This yields your expected results + the one bogus (but technically correct) 8b4e078a51d04e0e9efdf470027f0ec1@... email.

It uses a regex that strictly complies to RFC 5321, and double-checks the top-level domain against the official list for each substring that looks like a valid email.

answered Oct 17 '22 18:10

Zoran Simic

Instead of only capturing e-mail addresses, you can test everything you capture with the package validate_email (pip install validate_email) and retain only valid ones. The code would be some version of the code below:

from validate_email import validate_email
emails = [x if validate_email(x) else '' for x in list_of_potential_emails]

This package checks with the corresponding server if the e-mail (or the server) exists.

answered Oct 17 '22 17:10

Bernardo Trindade

Related questions
                            
                                Python argparse select a list from choices
                            
                                Is there a convention for indicating a quantity's units in Python code?
                            
                                Extracting data from list in Python, after BeautifulSoup scrape, and creating Pandas table
                            
                                Pandas in df column extract string after colon if colon exits; if not, keep text
                            
                                Expand pandas dataframe and consolidate columns
                            
                                Pandas dataframe groupby make a list or array of a column
                            
                                python import path for sub modules if put in namespace package
                            
                                Python / Pyspark - Correct method chaining order rules
                            
                                Pandas melt multiple columns to tabulate a dataset
                            
                                Speed up random weighted choice without replacement in python
                            
                                Seaborn title error - AttributeError: 'FacetGrid' object has no attribute 'set_title
                            
                                How to disable scientific notation in hvPlot plots?
                            
                                How to speed up the performance of array masking from the results of numpy.searchsorted in python?
                            
                                TF-IDF vectorizer to extract ngrams
                            
                                Exclude a function from coverage
                            
                                List comprehension loop ordering depends on nesting [closed]
                            
                                After upgrade, raw sql queries return json fields as strings on postgres
                            
                                Modify all elements in a python list and change the type from string to integer
                            
                                How do I avoid type errors when internal function returns 'Union' that could be 'None'?
                            
                                Groupby and aggregate using lambda functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With