What is the ideal program language for a web-crawler?

Tags:

web-crawler

I need to build a content gathering program that will simply read numbers on specified web pages, and save that data for analysis later. I don't need it to search for links or related data, just gather all data from websites that will have changing content daily.

I have very little programming experience, and I am hoping this will be good for learning. Speed is not a huge issue, I estimate that the crawler would at most have to load 4000 pages in a day.

Thanks.

Edit: Is there any way to test ahead of time if the websites from which I am gathering data are protected against crawlers?

824

asked Jun 10 '09 19:06

Alex

2 Answers

Python probably, or Perl.

Perl has a very nice LWP (Library for WWW in Perl), Python has urllib2.

Both are easy scripting languages available on most OSs.

I've done a crawler in Perl quite a few times, it's an evening of work.

And no, they can't really protect themselves from crawlers, except for using CAPTCHA of sort - everything else is easier to crack than to set up.

There was a point about Java: Java is fine. It's more verbose and requires some development environment setup: so you wouldn't do it in one evening, probably a week. For a small task, which question author indicated, that might be an overkill. On other hand, there are very useful libraries like lint, tagsoup (DOM traversal for random HTML out there) and lucene (full text indexing and search), so you might want Java for more serious projects. In this case, I'd recommend Apache commons-httpclient library for web-crawling (or nutch if you're crazy :).

Also: there are shelfware products that monitor changes on specified websites and present them in useful ways, so you might just grab one.

127

answered Dec 01 '22 00:12

alamar

The language you are most comfortable with is more than likey the best language to use.

I have very little programming experience

You might find that a web crawler is a bit of a baptism of fire and you need to build a few other more trivial applications to become familiar with your chosen language (and framework if applicable).

Good luck!

answered Dec 01 '22 00:12

Greg B

Related questions
                            
                                How do I extract data from a website using javascript.
                            
                                DFS vs BFS in web crawler design [closed]
                            
                                How write code to web crawling and scraping in R
                            
                                Where is the crawled data stored when running nutch crawler?
                            
                                SharePoint search not indexing contents of document libraries [closed]
                            
                                HTMLUnit not working with AngularJS
                            
                                How to get meta description content using Goutte
                            
                                DomCrawler Symfony: how to get content from a node excluding children?
                            
                                Why do we still need parser like BeautifulSoup if we can use Selenium?
                            
                                Multiprocessing with threading?
                            
                                Python web crawler with MySQL database
                            
                                Can we extract image from LPLinkView in LinkPresentation framework?
                            
                                Does html5mode(true) affect google search crawlers
                            
                                How to understand this raw HTML of Yahoo! Finance when retrieving data using Python?
                            
                                Purely JavaScript Solution for Google Ajax Crawlable Spec
                            
                                Fetch contents(loaded through AJAX call) of a web page
                            
                                Setting a select field's value using only the option's label, not its direct value, when writing a functional test in Symfony2
                            
                                Curl fails after following 50 redirects but wget works fine
                            
                                Is AJAX Really SEO Friendly?
                            
                                Scrapy: skip item and continue with exectuion

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With