Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the key considerations when creating a web crawler?

Tags:

web-crawler

I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community.

I want to use a crawler to walk over "the web" for a super simple purpose - "does the markup of site XYZ meet condition ABC?".

This raises a lot of questions for me, but I think the two main questions I need to get out of the way first are:

  • It feels a little "iffy" from the get go -- is this sort of thing acceptable?
  • What specific considerations should the crawler take to not upset people?
like image 315
Ian Robinson Avatar asked Aug 28 '08 14:08

Ian Robinson


People also ask

What is the major requirement of a crawler?

ArchitectureSpeed and efficiency are two basic requirements in any data crawler before it is let out on the internet. The architectural design of the webpage crawler programs or auto bots comes into the picture.

What features should be provided by a good crawler?

The crawl system should make efficient use of various system resources including processor, storage and network bandwidth. Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching ``useful'' pages first.

What is a good web crawler?

HTTrack is an open-source web crawler that allows users to download websites from the internet to a local system. It is one of the best web spidering tools that helps you to build a structure of your website. Features: This site crawler tool uses web crawlers to download website.


4 Answers

Obey robots.txt (and not too aggressive like has been said already).

You might want to think about your user-agent string - they're a good place to be up-front about what you're doing and how you can be contacted.

like image 150
Will Dean Avatar answered Oct 24 '22 15:10

Will Dean


Besides WillDean's and Einar's good answers, I would really recommend you take a time to read about the meaning of the HTTP response codes, and what your crawler should do when encountering each one, since it will make a big a difference on your performance, and on wether or not you are banned from some sites.

Some useful links:

HTTP/1.1: Status Code Definitions

Aggregator client HTTP tests

Wikipedia

like image 39
Ricardo Reyes Avatar answered Oct 24 '22 14:10

Ricardo Reyes


Please be sure to include a URL in your user-agent string that explains who/what/why your robot is crawling.

like image 43
ceejayoz Avatar answered Oct 24 '22 13:10

ceejayoz


All good points, the ones made here. You will also have to deal with dynamically-generated Java and JavaScript links, parameters and session IDs, escaping single and double quotes, failed attempts at relative links (using ../../ to go past the root directory), case sensitivity, frames, redirects, cookies....

I could go on for days, and kinda have. I have a Robots Checklist that covers most of this, and I'm happy answer what I can.

You should also think about using open-source robot crawler code, because it gives you a huge leg up on all these issues. I have a page on that as well: open source robot code. Hope that helps!

like image 35
user9569 Avatar answered Oct 24 '22 15:10

user9569