I need to build a content gathering program that will simply read numbers on specified web pages, and save that data for analysis later. I don't need it to search for links or related data, just gather all data from websites that will have changing content daily.
I have very little programming experience, and I am hoping this will be good for learning. Speed is not a huge issue, I estimate that the crawler would at most have to load 4000 pages in a day.
Thanks.
Edit: Is there any way to test ahead of time if the websites from which I am gathering data are protected against crawlers?
Python is regarded as the most commonly used programming language for web scraping. Incidentally, it is also the top programming language for 2021 according to IEEE Spectrum.
Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.
Their purpose is to index all the pages so that they can appear in the search engine results. The crawler tool is also written in C++ and makes use of internal libraries for making it efficient.
Short answer: Python! If you're scraping simple websites with a simple HTTP request. Python is your best bet. Libraries such as requests or HTTPX makes it very easy to scrape websites that don't require JavaScript to work correctly. Python offers a lot of simple-to-use HTTP clients.
Python probably, or Perl.
Perl has a very nice LWP (Library for WWW in Perl), Python has urllib2.
Both are easy scripting languages available on most OSs.
I've done a crawler in Perl quite a few times, it's an evening of work.
And no, they can't really protect themselves from crawlers, except for using CAPTCHA of sort - everything else is easier to crack than to set up.
There was a point about Java: Java is fine. It's more verbose and requires some development environment setup: so you wouldn't do it in one evening, probably a week.
For a small task, which question author indicated, that might be an overkill.
On other hand, there are very useful libraries like lint
, tagsoup
(DOM traversal for random HTML out there) and lucene
(full text indexing and search), so you might want Java for more serious projects.
In this case, I'd recommend Apache commons-httpclient
library for web-crawling (or nutch
if you're crazy :).
Also: there are shelfware products that monitor changes on specified websites and present them in useful ways, so you might just grab one.
The language you are most comfortable with is more than likey the best language to use.
I have very little programming experience
You might find that a web crawler is a bit of a baptism of fire and you need to build a few other more trivial applications to become familiar with your chosen language (and framework if applicable).
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With