Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

Tags:

web-crawler

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers.

An example:

We tell our crawler to crawl the domain evil.com by entering an initial lookup URL.

Lets assume we let it crawl the front page initially, evil.com/index

The returned HTML will contain several "unique" links:

  • evil.com/somePageOne
  • evil.com/somePageTwo
  • evil.com/somePageThree

The crawler will add these to the buffer of uncrawled URLs.

When somePageOne is being crawled, the crawler receives more URLs:

  • evil.com/someSubPageOne
  • evil.com/someSubPageTwo

These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole".

The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in.

These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity.

My question is, what techniques can be used to detect so called black holes?

One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites.

like image 841
Tom Avatar asked Dec 22 '10 19:12

Tom


People also ask

How do you find a crawler trap?

Googlebot is able to detect most spider traps. Once a spider trap is detected Google will stop crawling the trap and lower the crawl frequency of those pages. However detection of a crawl trap may take Google some time and after detection crawl budget is still being wasted on the spider trap, only less the before.

What is spider trap in web mining?

A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash.

What is a spider trap called?

Crawler traps—also known as "spider traps"—can seriously hurt your SEO performance by wasting your crawl budget and generating duplicate content. The term "crawler traps" refers to a structural issue within a website that results in crawlers finding a virtually infinite number of irrelevant URLs.

How do web spiders collect information?

They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next. Given the vast number of webpages on the Internet that could be indexed for search, this process could go on almost indefinitely.


1 Answers

Well, you've asked very challenging question. There are many issues:

First, do you think someone would do something like that to prevent web spidering? A web spider could act as DoS attack if it would got stuck in such structure.

Secondly, if page is made for users, how would they react to large number of senseless links linking to random generated 'trash sites'? This links should be invisible for user, either a few of them or they would be hidden somehow - you should then check, if links have display: none, 1 px font etc.

Third, how google would behave? Well, google does not index everything it can. It adds links to queue, but not follows them immediately. He does not like to follow deeply referenced links, that are not linked from pages previously indexed. It makes him not index everything, but index what users are most likely to visit is finally visited. Otherwise such pages as you describe will be extremally often used by SEO spammers ;)

I would build priority queue. Each link to each URL adds 1 point priority (more, when from main page). Pages with priority 1 are at the end list. I would limit visited pages count, so at worst case I would visity most important pages. I would be suspitious againt pages that contains too much links with too little content. In short words, simulate google behaviour as much as it is needed.

like image 186
Danubian Sailor Avatar answered Sep 29 '22 18:09

Danubian Sailor