I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community.
I want to use a crawler to walk over "the web" for a super simple purpose - "does the markup of site XYZ meet condition ABC?".
This raises a lot of questions for me, but I think the two main questions I need to get out of the way first are:
ArchitectureSpeed and efficiency are two basic requirements in any data crawler before it is let out on the internet. The architectural design of the webpage crawler programs or auto bots comes into the picture.
The crawl system should make efficient use of various system resources including processor, storage and network bandwidth. Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching ``useful'' pages first.
HTTrack is an open-source web crawler that allows users to download websites from the internet to a local system. It is one of the best web spidering tools that helps you to build a structure of your website. Features: This site crawler tool uses web crawlers to download website.
Obey robots.txt (and not too aggressive like has been said already).
You might want to think about your user-agent string - they're a good place to be up-front about what you're doing and how you can be contacted.
Besides WillDean's and Einar's good answers, I would really recommend you take a time to read about the meaning of the HTTP response codes, and what your crawler should do when encountering each one, since it will make a big a difference on your performance, and on wether or not you are banned from some sites.
Some useful links:
HTTP/1.1: Status Code Definitions
Aggregator client HTTP tests
Wikipedia
Please be sure to include a URL in your user-agent string that explains who/what/why your robot is crawling.
All good points, the ones made here. You will also have to deal with dynamically-generated Java and JavaScript links, parameters and session IDs, escaping single and double quotes, failed attempts at relative links (using ../../ to go past the root directory), case sensitivity, frames, redirects, cookies....
I could go on for days, and kinda have. I have a Robots Checklist that covers most of this, and I'm happy answer what I can.
You should also think about using open-source robot crawler code, because it gives you a huge leg up on all these issues. I have a page on that as well: open source robot code. Hope that helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With