Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WGet's Logic in Order of Downloading

Tags:

bash

wget

This is a more general question, but which has wider implications for a data mining project I'm running. I have been using wget to mirror archival webpages for analysis. This is a large amount of data and my current mirroring process has been going on for almost a week. Which has given me a lot of time to watch the readout.

How does wget determine the order in which it downloads pages? I can't seem to discern a consistant logic to its decision making process (it's not proceeding alphabetically, by date of original site creation, or by file type). As I begin to work with the data, this would be very helpful to grasp.

FWIW, here is the command that I'm using (it required cookies, and while the site's TOS do allow access 'by any means' I don't want to take any chances) - where SITE = URL:

wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE

Edited to Add: In comments to Chown's helpful answer, I refined my question a bit so here it is. With larger sites - say epe.lac-bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp - I find that it goes through initially creating a directory structure and some of the index.html/default.html pages, but then goes back through the disparate websites a few more times (grabbing a few more images and sub-pages on each pass, for example)

like image 804
canadian_scholar Avatar asked Oct 15 '11 23:10

canadian_scholar


1 Answers

From gnu.org wget Recursive Download:

  • Recursive Download

GNU Wget is capable of traversing parts of the Web (or a single http or ftp server), following links and directory structure. We refer to this as to recursive retrieval, or recursion.

With http urls, Wget retrieves and parses the html or css from the given url, retrieving the files the document refers to, through markup like href or src, or css uri values specified using the ‘url()’ functional notation. If the freshly downloaded file is also of type text/html, application/xhtml+xml, or text/css, it will be parsed and followed further.

Recursive retrieval of http and html/css content is breadth-first. This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth.

The maximum depth to which the retrieval may descend is specified with the ‘-l’ option. The default maximum depth is five layers.

When retrieving an ftp url recursively, Wget will retrieve all the data from the given directory tree (including the subdirectories up to the specified depth) on the remote server, creating its mirror image locally. ftp retrieval is also limited by the depth parameter. Unlike http recursion, ftp recursion is performed depth-first.

By default, Wget will create a local directory tree, corresponding to the one found on the remote server.

.... snip ....

Recursive retrieval should be used with care. Don't say you were not warned.


From my own very basic testing, it goes in order of appearance from top to bottom of the page when the structure depth is 1:

[ 16:28 root@host /var/www/html ]# cat index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en-US">
    <head>
        <link rel="stylesheet" type="text/css" href="style.css">
    </head>
    <body>
        <div style="text-align:center;">
            <h2>Mobile Test Page</h2>
        </div>
        <a href="/c.htm">c</a>
        <a href="/a.htm">a</a>
        <a href="/b.htm">b</a>
    </body>
</html>



[ 16:28 jon@host ~ ]$ wget -m http://98.164.214.224:8000
--2011-10-15 16:28:51--  http://98.164.214.224:8000/
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 556 [text/html]
Saving to: "98.164.214.224:8000/index.html"

100%[====================================================================================================================================================================================================>] 556         --.-K/s   in 0s

2011-10-15 16:28:51 (19.7 MB/s) - "98.164.214.224:8000/index.html" saved [556/556]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/style.css
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 221 [text/css]
Saving to: "98.164.214.224:8000/style.css"

100%[====================================================================================================================================================================================================>] 221         --.-K/s   in 0s

2011-10-15 16:28:51 (777 KB/s) - "98.164.214.224:8000/style.css" saved [221/221]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/c.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/html]
Saving to: "98.164.214.224:8000/c.htm"

    [ <=>                                                                                                                                                                                                 ] 0           --.-K/s   in 0s

2011-10-15 16:28:51 (0.00 B/s) - "98.164.214.224:8000/c.htm" saved [0/0]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/a.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/a.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (102 KB/s) - "98.164.214.224:8000/a.htm" saved [2/2]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/b.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/b.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (85.8 KB/s) - "98.164.214.224:8000/b.htm" saved [2/2]

FINISHED --2011-10-15 16:28:51--
Downloaded: 5 files, 781 in 0s (2.15 MB/s)
like image 167
chown Avatar answered Nov 15 '22 05:11

chown