I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:
Run forever
Means it will periodical re-visit some portal pages to get updates.
Schedule priorities.
Give different priorities to different type of URLs.
Multi thread fetch
I've read the Scrapy document but have not found something related to what I listed (maybe I am not careful enough). Is there anyone here know how to do that ? or just give some idea/example about it. Thanks!
You have to run a crawler on the web page using the fetch command in the Scrapy shell. A crawler or spider goes through a webpage downloading its text and metadata.
Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.
Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.
Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With