Language/libraries for downloading & parsing web pages?

Tags:

What language and libraries are suitable for a script to parse and download small numbers of web resources?

For example, some websites publish pseudo-podcasts, but not as proper RSS feeds; they just publish an MP3 file regularly with a web page containing the playlist. I want to write a script to run regularly and parse the relevant pages for the link and playlist info, download the MP3, and put the playlist in the MP3 tags so it shows up nicely in my iPod. There are a bunch of similar applications that I could write too.

What language would you recommend? I would like the script to run on Windows and MacOS. Here are some alternatives:

JavaScript. Just so I could use jQuery for the parsing. I don't know if jQuery works outside a browser though.
Python. Probably good library support for doing what I want. But I don't love Python syntax.
Ruby. I've done simple stuff (manual parsing) in Ruby before.
Clojure. Because I want to spend a bit of time with it.

What's your favourite language and libraries for doing this? And why? Are there any nice jQuery-like libraries for other languages?

805

asked Mar 04 '10 00:03

Bennett McElwee

2 Answers

If you want to spend some time with Clojure (a very good idea IMO!), give Enlive a shot. The GitHub description reads

a selector-based (à la CSS) templating and transformation system for Clojure — Read more

In addition to being useful for templating, it's a capable webscraping library; see the initial part of this tutorial for some simple scraping examples. (The third one is the New York Times homepage, so actually not as simple as all that.)

There are other tutorials available on the Web if you look for them; Enlive itself comes with some docs / examples. (Plus the code is < 1000 lines in total and very readable, though I suppose this might be less so for someone new to the language.)

answered Oct 14 '22 01:10

Michał Marczyk

Clojure link dumps, covering enlive, based on tagSoup and agents for parallel downloads (roundups/ link dumps aren't pretty, but I did spend some time googling/searching for different libs. Spidering/crawling can be very easy or pretty involved depending on the structure of sites crawled, HTML, XHTML, etc.)

http://blog.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

http://nakkaya.com/2009/12/17/mashups-using-clojure/

http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/

http://blog.maryrosecook.com/post/46601664/Writing-an-mp3-crawler-in-Clojure

http://gnuvince.wordpress.com/2008/11/18/fetching-web-comics-with-clojure-part-2/

http://htmlparser.sourceforge.net/

http://nakkaya.com/2009/11/23/converting-html-to-compojure-dsl/

http://www.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

apache http client

http://github.com/rnewman/clj-apache-http

http://github.com/heyZeus/clj-web-crawler

http://japhr.blogspot.com/2009/01/clojure-http-clientclj.html

answered Oct 14 '22 01:10

Gene T

Related questions
                            
                                Providing localized error messages for non-attributed model validation in ASP.Net MVC 2?
                            
                                How to do database schema migrations in Android?
                            
                                WinWord.exe won't quit after calling Word.Documents.Add - Word .NET Interop
                            
                                How to design a command line program reusable for a future development of a GUI? [closed]
                            
                                ExpectedException on TestMethod Visual Studio 2010
                            
                                what is the difference between plugin and library?
                            
                                How do I unit test a finalizer?
                            
                                Checking stack size in C#
                            
                                VS 2010 Web.config transformations for debugging
                            
                                Programmatically changing wireless router settings - Netgear ideally
                            
                                Thread safety and System.Text.Encoding in C#
                            
                                how does sql count work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With