Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mirror http website, excluding certain files

Tags:

wget

I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored & up-to-date. Essentially this website is just a directory listing with data organised into folders & I don't really care about keeping html files & other formatting elements. However there are some huge file types that are too large to download, so I want to ignore these.

Using the wget -m -R/--reject flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.

Here's how i'm using wget:

wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/

Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:

...
--2012-05-23 09:38:38-- http://web.server.org/folder/
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 401 Authorization Required
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2677 (2.6K) [text/html]
Saving to: `web.server.org/folder/index.html' 100%[======================================================================================================================>] 2,677 --.-K/s in 0s

Last-modified header missing -- time-stamps turned off.
2012-05-23 09:38:39 (328 MB/s) - `web.server.org/folder/index.html' saved [2677/2677]

Removing web.server.org/folder/index.html since it should be rejected.

...

is there a way to force wget to reject the file before downloading it?
Is there an alternative that I should consider?

Also, why do i get a 401 Authorization Required error for every downloaded file, despite supplying username & password. It's like wget tries to connect un-authenticated every time, before trying the username/password.

thanks, Mark

like image 222
drmjc Avatar asked May 23 '12 01:05

drmjc


2 Answers

Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).

FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-date.log

in the end, wget --exclude-directories did the trick:

wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder

Since the --exclude-directories wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.

Mark

like image 93
drmjc Avatar answered Nov 09 '22 21:11

drmjc


Parameter --reject 'pattern' actually worked for me with wget 1.14.

For example:

wget --reject rpm http://somerpmmirror.org/site/

All the *.rpm files were not downloaded at all, only indexes.

Warning: File patterns can be unintentionally expanded by bash if they match a file located in working directory. Please use quotes to avoid that:

touch blahblah.rpm
# working
wget -R '*.rpm' ....
# working
wget -R "*.rpm" ....
# not working
wget -R *.rpm ....
like image 44
radzimir Avatar answered Nov 09 '22 21:11

radzimir