I am currently running two distinct scrapy crawlers.
The 1st retrieves information about a product x and the 2nd retrieves other information about product x that is found on a url scraped by the 1st bot.
My pipeline concatenates each product's information into multiple text files, in which each product's information takes up one line of data and is broken up into multiple categories as distinct text files.
Each bot obviously maintains information integrity since all information is parsed one link at a time (hence each text file's information is aligned line-by-line with other text files). However, I understand scrapy uses a dynamic crawling mechanism that crawls websites based on their load time and not order in the start_url list. Thus, my 2nd crawler's information does not line up with the other text files from the 1st crawler.
One easy work-around for this is to scrape a "primary key" (mysql fanboys) variant of information that is found by both bots and can thus assist in aligning product information in a table by sorting the primary keys alphabetically and hence aligning the data manually.
My current project leaves me in a difficult spot in terms of finding a primary key, however. The 2nd crawler crawls websites with limited unique information, and hence my only shot at linking its findings back to the 1st crawler involves using the url identified by the 1st crawler and linking it to its identical start_url in the 2nd crawler.
Is there a way to assign the start_url being crawled in each iteration of the xhtmlselector to a variable that can then be pushed into the pipeline with the item/field data crawled on that particular url (in instances where it cannot be found in the source code)?
Here is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from Fleche_Noire.items import FlecheNoireItem
import codecs
class siteSpider(BaseSpider):
name = "bbs"
allowed_domains = ["http://www.samplewebsite.abc"]
start_urls = [
'http://www.samplewebsite.abc/prod1',
'http://www.samplewebsite.abc/prod2',
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = FlecheNoireItem()
item["brand"] = []
item["age"] = []
item["prodcode"] = hxs.select('//h1/text()').extract() or [' ']
item["description1"] = []
item["description2"] = []
item["product"] = []
item["availability"] = []
item["price"] = []
item["URL"] = []
item["imgurl"] = []
items.append(item)
return items
I'd like to be able to store the start_url as an item just like the h1 text found on the page.
Thank you!
you can get it from response.url
or in case of redirects even from response.request.url
, meaning:
item["start_url"] = response.request.url
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With