I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:
The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?
Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...
BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):
|-------------------------------------------------| | Title | Due Date | |-------------------------------------------------| | Job Title (Clickable Link) | 1/1/2012 | | Other Job (Link) | 3/2/2012 | |--------------------------------|----------------|
I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that. http://doc.scrapy.org/en/latest/topics/spiders.html look there for example.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
Please, first read the docs to understand what i say.
The answer:
To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta
parameter.
how do i merge results from target page to current page in scrapy?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With