Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scrapy Parse extracted link with another function

I am new to scrapy i am trying to scrape yellowpages for learning purposes everything works fine but i want the email address, but to do that i need to visit links extracted inside parse and parse it with another parse_email function but it does not wok .

I mean i tested the parse_email function it works but it does not work from inside the main parse function, i want the parse_email function to get source of the link, so i am calling the parse_email function using the callback but it only returns links like these <GET https://www.yellowpages.com/los-angeles-ca/mip/palm-tree-la-7254813?lid=7254813> where it should return the email for some reason the parse_email function is not working and just returning the link without opening the page

here is the code i have commented the parts

import scrapy
import requests
from urlparse import urljoin

scrapy.optional_features.remove('boto')

class YellowSpider(scrapy.Spider):
    name = 'yellow spider'
    start_urls = ['https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Los+Angeles%2C+CA']

    def parse(self, response):
        SET_SELECTOR = '.info'
        for brickset in response.css(SET_SELECTOR):

            NAME_SELECTOR = 'h3 a ::text'
            ADDRESS_SELECTOR = '.adr ::text'
            PHONE = '.phone.primary ::text'
            WEBSITE = '.links a ::attr(href)'


            #Getiing the link of the page that has the email usiing this selector
            EMAIL_SELECTOR = 'h3 a ::attr(href)'

            #extracting the link
            email = brickset.css(EMAIL_SELECTOR).extract_first()

            #joining and making complete url
            url = urljoin(response.url, brickset.css('h3 a ::attr(href)').extract_first())



            yield {
                'name': brickset.css(NAME_SELECTOR).extract_first(),
                'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
                'phone': brickset.css(PHONE).extract_first(),
                'website': brickset.css(WEBSITE).extract_first(),

                #ONLY Returning Link of the page not calling the function

                'email': scrapy.Request(url, callback=self.parse_email),
            }

        NEXT_PAGE_SELECTOR = '.pagination ul a ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract()[-1]
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

    def parse_email(self, response):

        #xpath for the email address in the nested page

        EMAIL_SELECTOR = '//a[@class="email-business"]/@href'

        #returning the extracted email WORKS XPATH WORKS I CHECKED BUT FUNCTION NOT CALLING FOR SOME REASON
        yield {
            'email': response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
        }

I don't know what i am doing wrong

like image 388
Shantanu Bedajna Avatar asked Feb 04 '23 21:02

Shantanu Bedajna


1 Answers

You are yielding a dict with a Request inside of it, Scrapy won't dispatch it because it doesn't know it's there (they don't get dispatched automatically after creating them). You need to yield the actual Request.

In the parse_email function, in order to "remember" which item each email belongs to, you will need to pass the rest of the item data alongside the request. You can do this with the meta argument.

Example:

in parse:

yield scrapy.Request(url, callback=self.parse_email, meta={'item': {
    'name': brickset.css(NAME_SELECTOR).extract_first(),
    'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
    'phone': brickset.css(PHONE).extract_first(),
    'website': brickset.css(WEBSITE).extract_first(),
}})

in parse_email:

item = response.meta['item']  # The item this email belongs to
item['email'] = response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
return item
like image 195
lufte Avatar answered Feb 07 '23 10:02

lufte