Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the ideal program language for a web-crawler?

Tags:

web-crawler

I need to build a content gathering program that will simply read numbers on specified web pages, and save that data for analysis later. I don't need it to search for links or related data, just gather all data from websites that will have changing content daily.

I have very little programming experience, and I am hoping this will be good for learning. Speed is not a huge issue, I estimate that the crawler would at most have to load 4000 pages in a day.

Thanks.

Edit: Is there any way to test ahead of time if the websites from which I am gathering data are protected against crawlers?

like image 824
Alex Avatar asked Jun 10 '09 19:06

Alex


People also ask

Which language is best for web crawling?

Python is regarded as the most commonly used programming language for web scraping. Incidentally, it is also the top programming language for 2021 according to IEEE Spectrum.

How do you code a web crawler?

Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.

What language is Google crawler written in?

Their purpose is to index all the pages so that they can appear in the search engine results. The crawler tool is also written in C++ and makes use of internal libraries for making it efficient.

Is Python or Java better for web scraping?

Short answer: Python! If you're scraping simple websites with a simple HTTP request. Python is your best bet. Libraries such as requests or HTTPX makes it very easy to scrape websites that don't require JavaScript to work correctly. Python offers a lot of simple-to-use HTTP clients.


2 Answers

Python probably, or Perl.

Perl has a very nice LWP (Library for WWW in Perl), Python has urllib2.

Both are easy scripting languages available on most OSs.

I've done a crawler in Perl quite a few times, it's an evening of work.

And no, they can't really protect themselves from crawlers, except for using CAPTCHA of sort - everything else is easier to crack than to set up.

There was a point about Java: Java is fine. It's more verbose and requires some development environment setup: so you wouldn't do it in one evening, probably a week. For a small task, which question author indicated, that might be an overkill. On other hand, there are very useful libraries like lint, tagsoup (DOM traversal for random HTML out there) and lucene (full text indexing and search), so you might want Java for more serious projects. In this case, I'd recommend Apache commons-httpclient library for web-crawling (or nutch if you're crazy :).

Also: there are shelfware products that monitor changes on specified websites and present them in useful ways, so you might just grab one.

like image 127
alamar Avatar answered Dec 01 '22 00:12

alamar


The language you are most comfortable with is more than likey the best language to use.

I have very little programming experience

You might find that a web crawler is a bit of a baptism of fire and you need to build a few other more trivial applications to become familiar with your chosen language (and framework if applicable).

Good luck!

like image 28
Greg B Avatar answered Dec 01 '22 00:12

Greg B