Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wget: Skip download if file already exists?

Tags:

wget

Answers to Skip download if files exist in wget? say to use -nc, or --no-clobber, but -nc doesn't prevent the sending of the HTTP request and subsequent downloading of the file. It just doesn't do anything after downloading the file if the file has already been fully retrieved. Is there anyway to prevent making the HTTP request if the file already exists?

I installed wget 1.16.3 with Homebrew. After running the command below, wget said something like making HTTP request for each file that already existed, appeared to download it, and then said something like: file already retrieved, nothing to do.

wget --user-agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12' \
     --tries=1 \
     --no-clobber \
     --continue \
     --wait=0.3 \
     --random-wait \
     --adjust-extension \
     --load-cookies cookies.txt \
     --save-cookies cookies.txt \
     --keep-session-cookies \
         --recursive \
         --level=inf \
         --convert-links \
         --page-requisites \
         --reject=edit,logout,rate \
         --domains=example.com,s3.amazonaws.com \
         --span-hosts \
         --exclude-directories=/admin \
     http://example.com/
like image 685
ma11hew28 Avatar asked Oct 18 '15 22:10

ma11hew28


2 Answers

The -nc option does what you're asking for, at least in wget 1.19.1.


On my server, I have a file called index.html which contains links to a.html and b.html.

$ wget -r -nc http://127.0.0.1:8000/

Server logs show this:

127.0.0.1 - - [23/Mar/2017 17:51:25] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2017 17:51:25] "GET /robots.txt HTTP/1.1" 404 -
127.0.0.1 - - [23/Mar/2017 17:51:25] "GET /a.html HTTP/1.1" 200 -
127.0.0.1 - - [23/Mar/2017 17:51:25] "GET /b.html HTTP/1.1" 200 -

Now I remove b.html and run it again:

$ rm 127.0.0.1\:8000/b.html
$ wget -r -nc http://127.0.0.1:8000/

Server logs show this:

127.0.0.1 - - [23/Mar/2017 17:51:38] "GET /robots.txt HTTP/1.1" 404 -
127.0.0.1 - - [23/Mar/2017 17:51:38] "GET /b.html HTTP/1.1" 200 -

As you can see, only a request for b.html was made.

like image 97
Snowball Avatar answered Oct 19 '22 09:10

Snowball


It appears you are using incompatible options, I get the following warning on wget 1.16 linux:

$ wget --no-clobber --convert-links http://example.com
Both --no-clobber and --convert-links were specified, only --convert-links will be used.
like image 8
a guest Avatar answered Oct 19 '22 08:10

a guest