Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - how to identify already scraped urls

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor.

like image 380
Avinash Avatar asked Oct 06 '10 10:10

Avinash


People also ask

Which is better Scrapy or BeautifulSoup?

Scrapy is a more robust, feature-complete, more extensible, and more maintained web scraping tool. Scrapy allows you to crawl, extract, and store a full website. BeautilfulSoup on the other end only allows you to parse HTML and extract the information you're looking for.

Can I use Scrapy with Beautiful soup?

Can I use Scrapy with BeautifulSoup? ¶ Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks.

What does Scrapy request return?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

Can Scrapy scrape JavaScript?

Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.


1 Answers

You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:

SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }

The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

Finally, you'll need to modify your items.py so that each item class has the following fields:

visit_id = Field()
visit_status = Field()

And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

Good luck!

like image 132
Jama22 Avatar answered Oct 22 '22 20:10

Jama22