What are the key considerations when creating a web crawler?

Tags:

web-crawler

I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community.

I want to use a crawler to walk over "the web" for a super simple purpose - "does the markup of site XYZ meet condition ABC?".

This raises a lot of questions for me, but I think the two main questions I need to get out of the way first are:

It feels a little "iffy" from the get go -- is this sort of thing acceptable?
What specific considerations should the crawler take to not upset people?

315

asked Aug 28 '08 14:08

Ian Robinson

4 Answers

Obey robots.txt (and not too aggressive like has been said already).

You might want to think about your user-agent string - they're a good place to be up-front about what you're doing and how you can be contacted.

150

answered Oct 24 '22 15:10

Will Dean

Besides WillDean's and Einar's good answers, I would really recommend you take a time to read about the meaning of the HTTP response codes, and what your crawler should do when encountering each one, since it will make a big a difference on your performance, and on wether or not you are banned from some sites.

Some useful links:

HTTP/1.1: Status Code Definitions

Aggregator client HTTP tests

Wikipedia

answered Oct 24 '22 14:10

Ricardo Reyes

Please be sure to include a URL in your user-agent string that explains who/what/why your robot is crawling.

answered Oct 24 '22 13:10

ceejayoz

All good points, the ones made here. You will also have to deal with dynamically-generated Java and JavaScript links, parameters and session IDs, escaping single and double quotes, failed attempts at relative links (using ../../ to go past the root directory), case sensitivity, frames, redirects, cookies....

I could go on for days, and kinda have. I have a Robots Checklist that covers most of this, and I'm happy answer what I can.

You should also think about using open-source robot crawler code, because it gives you a huge leg up on all these issues. I have a page on that as well: open source robot code. Hope that helps!

answered Oct 24 '22 15:10

user9569

Related questions
                            
                                How to scrape all the content of each link with scrapy?
                            
                                Rotating Proxies for web scraping
                            
                                Tor Web Crawler
                            
                                InvalidArgumentException: The current node list is empty. PHP-Spider (DOMCrawler Symfony)
                            
                                Scrapy delay request
                            
                                scrapyd-client command not found
                            
                                scrapy crawler caught exception reading instance data
                            
                                Crawler4j vs. Jsoup for the pages crawling and parsing in Java
                            
                                How to get a web page's source code from Java [duplicate]
                            
                                How to allow crawlers access to index.php only, using robots.txt?
                            
                                Websites that are particularly challenging to crawl and scrape? [closed]
                            
                                Obtaining static HTML files from Wikipedia XML dump
                            
                                Is there a way to get all posts for a given subreddit instead of just the posts newer than one month?
                            
                                How to build a web crawler based on Scrapy to run forever?
                            
                                Nutch No agents listed in 'http.agent.name'
                            
                                How to crawl a website/extract data into database with python?
                            
                                How to use Goutte
                            
                                Scrapy - Understanding CrawlSpider and LinkExtractor
                            
                                Selenium pdf automatic download not working
                            
                                Scrapy - Select specific link based on text

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With