What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

Tags:

web-crawler

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers.

An example:

We tell our crawler to crawl the domain evil.com by entering an initial lookup URL.

Lets assume we let it crawl the front page initially, evil.com/index

The returned HTML will contain several "unique" links:

evil.com/somePageOne
evil.com/somePageTwo
evil.com/somePageThree

The crawler will add these to the buffer of uncrawled URLs.

When somePageOne is being crawled, the crawler receives more URLs:

evil.com/someSubPageOne
evil.com/someSubPageTwo

These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole".

The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in.

These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity.

My question is, what techniques can be used to detect so called black holes?

One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites.

841

asked Dec 22 '10 19:12

Tom

1 Answers

Well, you've asked very challenging question. There are many issues:

First, do you think someone would do something like that to prevent web spidering? A web spider could act as DoS attack if it would got stuck in such structure.

Secondly, if page is made for users, how would they react to large number of senseless links linking to random generated 'trash sites'? This links should be invisible for user, either a few of them or they would be hidden somehow - you should then check, if links have display: none, 1 px font etc.

Third, how google would behave? Well, google does not index everything it can. It adds links to queue, but not follows them immediately. He does not like to follow deeply referenced links, that are not linked from pages previously indexed. It makes him not index everything, but index what users are most likely to visit is finally visited. Otherwise such pages as you describe will be extremally often used by SEO spammers ;)

I would build priority queue. Each link to each URL adds 1 point priority (more, when from main page). Pages with priority 1 are at the end list. I would limit visited pages count, so at worst case I would visity most important pages. I would be suspitious againt pages that contains too much links with too little content. In short words, simulate google behaviour as much as it is needed.

186

answered Sep 29 '22 18:09

Danubian Sailor

Related questions
                            
                                Scrapy CrawlSpider doesn't crawl the first landing page
                            
                                Can I use WGET to generate a sitemap of a website given its URL?
                            
                                Creating a generic scrapy spider
                            
                                How to improve SEO for single page application
                            
                                is Scrapy single-threaded or multi-threaded?
                            
                                Is there a list of known web crawlers? [closed]
                            
                                Should I create pipeline to save files with scrapy?
                            
                                Python Scrapy on offline (local) data
                            
                                How to extract URLs from an HTML page in Python [closed]
                            
                                Locally run all of the spiders in Scrapy
                            
                                Web crawler that can interpret JavaScript [closed]
                            
                                What is the difference between Scrapy's spider middleware and downloader middleware? [closed]
                            
                                Does solr do web crawling?
                            
                                Is it possible for Scrapy to get plain text from raw HTML data?
                            
                                Scrapy: HTTP status code is not handled or not allowed?
                            
                                How can I handle Javascript in a Perl web crawler?
                            
                                Java Web Crawler Libraries
                            
                                Scrapy - logging to file and stdout simultaneously, with spider names
                            
                                How can I safely check is node empty or not? (Symfony 2 Crawler)
                            
                                how to totally ignore 'debugger' statement in chrome?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With