I am trying to find all the email addresses on a page using scrapy.
I found a xpath which should return the email addresses but when I run the code below it doesnt find any email addresses (which I know are there). And I get errors like:
File "C:\Anaconda2\lib\site-packages\scrapy\selector\unified.py", line 100, in xpath raise ValueError(msg if six.PY3 else msg.encode("unicode_escape")) ValueError: Invalid XPath: //[-a-zA-Z0-9.]+@[-a-zA-Z0-9]+.[a-zA-Z0-9_.]+
This is what my code looks like. Can someone tell me what I'm doing wrong?
I've narrowed down the problem to the xpath but cannot figure out how to fix it.
import scrapy
import datetime
from scrapy.spiders import CrawlSpider
from techfinder.items import EmailItem
from scrapy.selector import HtmlXPathSelector
class DetectSpider(scrapy.Spider):
name = "test"
alloweddomainfile = open("emaildomains.txt")
allowed_domains = [domain.strip() for domain in alloweddomainfile.readlines()]
alloweddomainfile.close()
starturlfile = open("emailurls.txt")
start_urls = [url.strip() for url in starturlfile.readlines()]
starturlfile.close()
def parse(self, response):
hxs = HtmlXPathSelector(response)
emails = hxs.xpath('//[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+').extract()
#[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+
#<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\
#/^(|(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6})$/i
emailitems = []
for email in zip(emails):
emailitem = EmailItem()
emailitem["email"] = emails
emailitem["source"] = response.url
emailitem["datetime"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
emailitems.append(emailitem)
return emailitems
You can use a regular expression search on the response.body to find email ids.
emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
Extending on Doctor Strange's answer, you can use scrapy's builtin regex functionality. This way is a bit tidier and you won't have to import re.
This line is the problem
emails = hxs.xpath('//[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+').extract()
You are using an xpath selector but that is a regex pattern you have dropped in. If you change this to:
emails = hxs.xpath('//body').re('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
That will give you a list of the emails in the body.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With