Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding email addresses in body using scrapy

I am trying to find all the email addresses on a page using scrapy.

I found a xpath which should return the email addresses but when I run the code below it doesnt find any email addresses (which I know are there). And I get errors like:

File "C:\Anaconda2\lib\site-packages\scrapy\selector\unified.py", line 100, in xpath raise ValueError(msg if six.PY3 else msg.encode("unicode_escape")) ValueError: Invalid XPath: //[-a-zA-Z0-9.]+@[-a-zA-Z0-9]+.[a-zA-Z0-9_.]+

This is what my code looks like. Can someone tell me what I'm doing wrong?

I've narrowed down the problem to the xpath but cannot figure out how to fix it.

import scrapy
import datetime
from scrapy.spiders import CrawlSpider
from techfinder.items import EmailItem
from scrapy.selector import HtmlXPathSelector


class DetectSpider(scrapy.Spider):
    name = "test"

    alloweddomainfile = open("emaildomains.txt")
    allowed_domains = [domain.strip() for domain in alloweddomainfile.readlines()]
    alloweddomainfile.close()

    starturlfile = open("emailurls.txt")
    start_urls = [url.strip() for url in starturlfile.readlines()]
    starturlfile.close()


    def parse(self, response):




        hxs = HtmlXPathSelector(response)


        emails = hxs.xpath('//[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+').extract()             
        #[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+
        #<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\
        #/^(|(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6})$/i



        emailitems = []
        for email in zip(emails):
            emailitem = EmailItem()
            emailitem["email"] = emails
            emailitem["source"] = response.url
            emailitem["datetime"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            emailitems.append(emailitem)
        return emailitems
like image 790
user1287245 Avatar asked Apr 11 '16 19:04

user1287245


2 Answers

You can use a regular expression search on the response.body to find email ids.

emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
like image 124
Doctor Strange Avatar answered Sep 24 '22 23:09

Doctor Strange


Extending on Doctor Strange's answer, you can use scrapy's builtin regex functionality. This way is a bit tidier and you won't have to import re.

This line is the problem

emails = hxs.xpath('//[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+').extract() 

You are using an xpath selector but that is a regex pattern you have dropped in. If you change this to:

emails = hxs.xpath('//body').re('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')

That will give you a list of the emails in the body.

like image 35
chasmani Avatar answered Sep 26 '22 23:09

chasmani