Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:

The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?

Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...

BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):

|-------------------------------------------------| |             Title              |    Due Date    | |-------------------------------------------------| | Job Title (Clickable Link)     |    1/1/2012    | | Other Job (Link)               |    3/2/2012    | |--------------------------------|----------------| 

I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.

like image 842
dru Avatar asked Feb 17 '12 19:02

dru


People also ask

What is Start_urls in Scrapy?

start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that. http://doc.scrapy.org/en/latest/topics/spiders.html look there for example.

What is callback in Scrapy?

In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.


1 Answers

Please, first read the docs to understand what i say.

The answer:

To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter.

how do i merge results from target page to current page in scrapy?

like image 197
warvariuc Avatar answered Oct 09 '22 09:10

warvariuc