I am new to scrapy i am trying to scrape yellowpages for learning purposes everything works fine but i want the email address, but to do that i need to visit links extracted inside parse and parse it with another parse_email function but it does not wok .
I mean i tested the parse_email function it works but it does not work from inside the main parse function, i want the parse_email function to get source of the link, so i am calling the parse_email function using the callback but it only returns links like these <GET https://www.yellowpages.com/los-angeles-ca/mip/palm-tree-la-7254813?lid=7254813>
where it should return the email for some reason the parse_email function is not working and just returning the link without opening the page
here is the code i have commented the parts
import scrapy
import requests
from urlparse import urljoin
scrapy.optional_features.remove('boto')
class YellowSpider(scrapy.Spider):
name = 'yellow spider'
start_urls = ['https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Los+Angeles%2C+CA']
def parse(self, response):
SET_SELECTOR = '.info'
for brickset in response.css(SET_SELECTOR):
NAME_SELECTOR = 'h3 a ::text'
ADDRESS_SELECTOR = '.adr ::text'
PHONE = '.phone.primary ::text'
WEBSITE = '.links a ::attr(href)'
#Getiing the link of the page that has the email usiing this selector
EMAIL_SELECTOR = 'h3 a ::attr(href)'
#extracting the link
email = brickset.css(EMAIL_SELECTOR).extract_first()
#joining and making complete url
url = urljoin(response.url, brickset.css('h3 a ::attr(href)').extract_first())
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
'phone': brickset.css(PHONE).extract_first(),
'website': brickset.css(WEBSITE).extract_first(),
#ONLY Returning Link of the page not calling the function
'email': scrapy.Request(url, callback=self.parse_email),
}
NEXT_PAGE_SELECTOR = '.pagination ul a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract()[-1]
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
def parse_email(self, response):
#xpath for the email address in the nested page
EMAIL_SELECTOR = '//a[@class="email-business"]/@href'
#returning the extracted email WORKS XPATH WORKS I CHECKED BUT FUNCTION NOT CALLING FOR SOME REASON
yield {
'email': response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
}
I don't know what i am doing wrong
You are yielding a dict
with a Request
inside of it, Scrapy won't dispatch it because it doesn't know it's there (they don't get dispatched automatically after creating them). You need to yield the actual Request
.
In the parse_email
function, in order to "remember" which item each email belongs to, you will need to pass the rest of the item data alongside the request. You can do this with the meta
argument.
Example:
in parse
:
yield scrapy.Request(url, callback=self.parse_email, meta={'item': {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
'phone': brickset.css(PHONE).extract_first(),
'website': brickset.css(WEBSITE).extract_first(),
}})
in parse_email
:
item = response.meta['item'] # The item this email belongs to
item['email'] = response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With