Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Store/scrape current start_url?

Background (can be skipped):

I am currently running two distinct scrapy crawlers.

The 1st retrieves information about a product x and the 2nd retrieves other information about product x that is found on a url scraped by the 1st bot.

My pipeline concatenates each product's information into multiple text files, in which each product's information takes up one line of data and is broken up into multiple categories as distinct text files.

Each bot obviously maintains information integrity since all information is parsed one link at a time (hence each text file's information is aligned line-by-line with other text files). However, I understand scrapy uses a dynamic crawling mechanism that crawls websites based on their load time and not order in the start_url list. Thus, my 2nd crawler's information does not line up with the other text files from the 1st crawler.

One easy work-around for this is to scrape a "primary key" (mysql fanboys) variant of information that is found by both bots and can thus assist in aligning product information in a table by sorting the primary keys alphabetically and hence aligning the data manually.

My current project leaves me in a difficult spot in terms of finding a primary key, however. The 2nd crawler crawls websites with limited unique information, and hence my only shot at linking its findings back to the 1st crawler involves using the url identified by the 1st crawler and linking it to its identical start_url in the 2nd crawler.


Problem:

Is there a way to assign the start_url being crawled in each iteration of the xhtmlselector to a variable that can then be pushed into the pipeline with the item/field data crawled on that particular url (in instances where it cannot be found in the source code)?

Here is my code:

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from Fleche_Noire.items import FlecheNoireItem
    import codecs

    class siteSpider(BaseSpider):
        name = "bbs"
        allowed_domains = ["http://www.samplewebsite.abc"]
        start_urls = [    
            'http://www.samplewebsite.abc/prod1',
            'http://www.samplewebsite.abc/prod2',
       ]



        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            items = []
            item = FlecheNoireItem()
            item["brand"] = []
            item["age"] = []
            item["prodcode"] = hxs.select('//h1/text()').extract() or [' '] 
            item["description1"] = []
            item["description2"] = []
            item["product"] = []
            item["availability"] = []
            item["price"] = []
            item["URL"] = []
            item["imgurl"] = []
            items.append(item)
            return items

I'd like to be able to store the start_url as an item just like the h1 text found on the page.

Thank you!

like image 667
smoles Avatar asked Dec 07 '22 03:12

smoles


1 Answers

you can get it from response.url or in case of redirects even from response.request.url, meaning:

item["start_url"] = response.request.url
like image 100
Guy Gavriely Avatar answered Dec 31 '22 16:12

Guy Gavriely