Anyone know of a good Python based web crawler that I could use?

Tags:

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.

Thanks in advance. Also, this is my first SO question!

548

asked Jan 07 '09 04:01

Matt

2 Answers

Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
Twill is a simple scripting language built on top of Mechanize
BeautifulSoup + urllib2 also works quite nicely.
Scrapy looks like an extremely promising project; it's new.

155

answered Oct 15 '22 07:10

RexE

Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

Built-in support for parsing HTML, XML, CSV, and Javascript
A media pipeline for scraping items with images (or any other media) and download the image files as well
Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
Interactive scraping shell console, very useful for developing and debugging
Web management console for monitoring and controlling your bot
Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):     pass  class MininovaSpider(CrawlSpider):     domain_name = 'mininova.org'     start_urls = ['http://www.mininova.org/today']     rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]      def parse_torrent(self, response):         x = HtmlXPathSelector(response)         torrent = Torrent()          torrent.url = response.url         torrent.name = x.x("//h1/text()").extract()         torrent.description = x.x("//div[@id='description']").extract()         torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()         return [torrent]

answered Oct 15 '22 06:10

nosklo

Related questions
                            
                                stopping setup.py from installing as egg
                            
                                Get webpage contents with Python?
                            
                                How can I use seaborn without changing the matplotlib defaults?
                            
                                Nested f-strings
                            
                                Rearrange columns of numpy 2D array
                            
                                Set legend symbol opacity with matplotlib?
                            
                                Python Sound ("Bell")
                            
                                Send log messages from all celery tasks to a single file
                            
                                python copy files by wildcards
                            
                                How to add if condition in a TensorFlow graph?
                            
                                logging remove / inspect / modify handlers configured by fileConfig()
                            
                                How do I use subprocess.Popen to connect multiple processes by pipes?
                            
                                How to decorate a method inside a class?
                            
                                Python - calendar.timegm() vs. time.mktime()
                            
                                Ansible creating a virtualenv
                            
                                How to evaluate environment variables into a string in Python?
                            
                                How to pickle a namedtuple instance correctly
                            
                                Continuous Integration System for a Python Codebase
                            
                                Binary buffer in Python
                            
                                What is the fastest way to parse large XML docs in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Anyone know of a good Python based web crawler that I could use?

Tags:

python

web-crawler

Matt

People also ask

2 Answers

RexE

nosklo

Recent Activity

Donate For Us