Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.
A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.
You'll be reinventing the wheel, to be sure. But here's the basics:
Put these in persistent storage, so you can stop and start the crawler without losing state.
Algorithm is:
while(list of unvisited URLs is not empty) {
take URL from list
remove it from the unvisited list and add it to the visited list
fetch content
record whatever it is you want to about the content
if content is HTML {
parse out URLs from links
foreach URL {
if it matches your rules
and it's not already in either the visited or unvisited list
add it to the unvisited list
}
}
}
The complicated part of a crawler is if you want to scale it to a huge number of websites/requests. In this situation you will have to deal with some issues like:
Impossibility to keep info all in one database.
Not enough RAM to deal with huge index(s)
Multithread performance and concurrency
Crawler traps (infinite loop created by changing urls, calendars, sessions ids...) and duplicated content.
Crawl from more than one computer
Malformed HTML codes
Constant http errors from servers
Databases without compression, wich make your need for space about 8x bigger.
Recrawl routines and priorities.
Use requests with compression (Deflate/gzip) (good for any kind of crawler).
And some important things
Respect robots.txt
And a crawler delay on each request to dont suffocate web servers.
Multithreaded Web Crawler
If you want to crawl large sized website then you should write a multi-threaded crawler. connecting,fetching and writing crawled information in files/database - these are the three steps of crawling but if you use a single threaded than your CPU and network utilization will be pour.
A multi threaded web crawler needs two data structures- linksVisited(this should be implemented as a hashmap or trai) and linksToBeVisited(this is a queue).
Web crawler uses BFS to traverse world wide web.
Algorithm of a basic web crawler:-
repeat step 2 to 5 until queue is linksToBeVisited empty.
Here is a code snippet on how to synchronize the threads....
public void add(String site) {
synchronized (this) {
if (!linksVisited.contains(site)) {
linksToBeVisited.add(site);
}
}
}
public String next() {
if (linksToBeVisited.size() == 0) {
return null;
}
synchronized (this) {
// Need to check again if size has changed
if (linksToBeVisited.size() > 0) {
String s = linksToBeVisited.get(0);
linksToBeVisited.remove(0);
linksVisited.add(s);
return s;
}
return null;
}
}
Crawlers are simple in concept.
You get a root page via a HTTP GET, parse it to find URLs and put them on a queue unless they've been parsed already (so you need a global record of pages you have already parsed).
You can use the Content-type header to find out what the type of content is, and limit your crawler to only parsing the HTML types.
You can strip out the HTML tags to get the plain text, which you can do text analysis on (to get tags, etc, the meat of the page). You could even do that on the alt/title tags for images if you got that advanced.
And in the background you can have a pool of threads eating URLs from the Queue and doing the same. You want to limit the number of threads of course.
If your NPO's sites are relatively big or complex (having dynamic pages that'll effectively create a 'black hole' like a calendar with a 'next day' link) you'd be better using a real web crawler, like Heritrix.
If the sites total a few number of pages you can get away with just using curl or wget or your own. Just remember if they start to get big or you start making your script more complex to just use a real crawler or at least look at its source to see what are they doing and why.
Some issues (there are more):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With