Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anyone know of a good Python based web crawler that I could use?

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.

Thanks in advance. Also, this is my first SO question!

like image 548
Matt Avatar asked Jan 07 '09 04:01

Matt


People also ask

Can Python be used for web crawler?

Web crawling is a powerful technique to collect data from the web by finding all the URLs for one or multiple domains. Python has several popular web crawling libraries and frameworks.


2 Answers

  • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
  • Twill is a simple scripting language built on top of Mechanize
  • BeautifulSoup + urllib2 also works quite nicely.
  • Scrapy looks like an extremely promising project; it's new.
like image 155
RexE Avatar answered Oct 15 '22 07:10

RexE


Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):     pass  class MininovaSpider(CrawlSpider):     domain_name = 'mininova.org'     start_urls = ['http://www.mininova.org/today']     rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]      def parse_torrent(self, response):         x = HtmlXPathSelector(response)         torrent = Torrent()          torrent.url = response.url         torrent.name = x.x("//h1/text()").extract()         torrent.description = x.x("//div[@id='description']").extract()         torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()         return [torrent] 
like image 41
nosklo Avatar answered Oct 15 '22 06:10

nosklo