Finding email addresses in body using scrapy

Question

I am trying to find all the email addresses on a page using scrapy.

I found a xpath which should return the email addresses but when I run the code below it doesnt find any email addresses (which I know are there). And I get errors like:

File "C:\Anaconda2\lib\site-packages\scrapy\selector\unified.py", line 100, in xpath raise ValueError(msg if six.PY3 else msg.encode("unicode_escape")) ValueError: Invalid XPath: //[-a-zA-Z0-9.]+@[-a-zA-Z0-9]+.[a-zA-Z0-9_.]+

This is what my code looks like. Can someone tell me what I'm doing wrong?

I've narrowed down the problem to the xpath but cannot figure out how to fix it.

import scrapy
import datetime
from scrapy.spiders import CrawlSpider
from techfinder.items import EmailItem
from scrapy.selector import HtmlXPathSelector


class DetectSpider(scrapy.Spider):
    name = "test"

    alloweddomainfile = open("emaildomains.txt")
    allowed_domains = [domain.strip() for domain in alloweddomainfile.readlines()]
    alloweddomainfile.close()

    starturlfile = open("emailurls.txt")
    start_urls = [url.strip() for url in starturlfile.readlines()]
    starturlfile.close()


    def parse(self, response):




        hxs = HtmlXPathSelector(response)


        emails = hxs.xpath('//[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+').extract()             
        #[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+
        #<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\
        #/^(|(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6})$/i



        emailitems = []
        for email in zip(emails):
            emailitem = EmailItem()
            emailitem["email"] = emails
            emailitem["source"] = response.url
            emailitem["datetime"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            emailitems.append(emailitem)
        return emailitems

Doctor Strange · Accepted Answer

You can use a regular expression search on the response.body to find email ids.

emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)

chasmani · Answer

Extending on Doctor Strange's answer, you can use scrapy's builtin regex functionality. This way is a bit tidier and you won't have to import re.

This line is the problem

emails = hxs.xpath('//[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+').extract()

You are using an xpath selector but that is a regex pattern you have dropped in. If you change this to:

emails = hxs.xpath('//body').re('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')

That will give you a list of the emails in the body.

Finding email addresses in body using scrapy

Tags:

python

xpath

scrapy

user1287245

2 Answers

Doctor Strange

chasmani

Recent Activity

Donate For Us

Finding email addresses in body using scrapy

Tags:

python

xpath

scrapy

user1287245

2 Answers

Doctor Strange

chasmani

Related questions

Recent Activity

Donate For Us