Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do web spiders differ from Wget's spider?

The next sentence caught my eye in Wget's manual

wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the functionality of real web spiders.

I find the following lines of code relevant for the spider option in wget.

src/ftp.c
780:      /* If we're in spider mode, don't really retrieve anything.  The
784:      if (opt.spider)
889:  if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227:      if (!opt.spider)
1239:      if (!opt.spider)
1268:      else if (!opt.spider)
1827:          if (opt.htmlify && !opt.spider)

src/http.c
64:#include "spider.h"
2405:  /* Skip preliminary HEAD request if we're not in spider mode AND
2407:  if (!opt.spider
2428:      if (opt.spider && !got_head)
2456:      /* Default document type is empty.  However, if spider mode is
2570:           * spider mode.  */
2571:          else if (opt.spider)
2661:              if (opt.spider)

src/res.c
543:  int saved_sp_val = opt.spider;
548:  opt.spider       = false;
551:  opt.spider       = saved_sp_val;  

src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)

src/spider.h
1:/* Declarations for spider.c

src/recur.c
52:#include "spider.h"
279:      if (opt.spider)
366:              || opt.spider /* opt.recursive is implicitely true */
370:             (otherwise unneeded because of --spider or rejected by -R) 
375:                   (opt.spider ? "--spider" : 
378:                     (opt.delete_after || opt.spider
440:      if (opt.spider) 

src/options.h
62:  bool spider;           /* Is Wget in spider mode? */

src/init.c
238:  { "spider",           &opt.spider,            cmd_boolean },

src/main.c
56:#include "spider.h"
238:    { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435:       --spider                  don't download anything.\n"),
1045:  if (opt.recursive && opt.spider)

I would like to see the differences in code, not abstractly. I love code examples.

How do web spiders differ from Wget's spider in code?

like image 543
Léo Léopold Hertz 준영 Avatar asked Apr 17 '09 21:04

Léo Léopold Hertz 준영


People also ask

What is wget-- spider?

The wget tool is essentially a spider that scrapes / leeches web pages but some web hosts may block these spiders with the robots. txt files. Also, wget will not follow links on web pages that use the rel=nofollow attribute. You can however force wget to ignore the robots.

What is Computer spider?

(1) Also known as a "crawler," "robot" (bot) and "intelligent agent," a spider is a program that searches for information on the Web. Spiders are widely used by Web search engines to index all the pages on a site by following the links from page to page.


1 Answers

A real spider is a lot of work

Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:

  • Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
  • You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
  • You need to be polite to the spidered web sites:
    • Respect the robots.txt files.
    • Don't fetch a lot of information too quickly: this overloads the servers.
    • Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).
  • You have to deal with cookies/session ids: Many sites attach unique session ids to URLs to identify client sessions. Each time you arrive at the site, you get a new session id and a new virtual world of pages (with the same content). Because of such problems, early search engines ignored dynamic content. Modern search engines have learned what the problems are and how to deal with them.
  • You have to detect and ignore troublesome data: connections that provide a seemingly infinite amount of data or connections that are too slow to finish.
  • Besides following links, you may want to parse sitemaps to get URLs of pages.
  • You may want to evaluate which information is important for you and changes frequently to be refreshed more frequently than other pages. Note: A spider for the whole WWW receives a lot of data --- you pay for that bandwidth. You may want to use HTTP HEAD requests to guess whether a page has changed or not.
  • Besides receiving, you want to process the information and store it. Google builds indices that list for each word the pages that contain it. You may need separate storage computers and an infrastructure to connect them. Traditional relational data bases don't keep up with the data volume and performance requirements of storing/indexing the whole WWW.

This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.

Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.

Let me add a few more technical remarks on three points

Parallel connections / asynchronous socket communication

You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.

A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.

Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.

From the original answer: The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.

Edit five years later: Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...

Partitioning the work over several servers

One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.

Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.

If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.

Read the standards

I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.

It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.

like image 172
Yaakov Belch Avatar answered Sep 28 '22 19:09

Yaakov Belch