Why does wget only download the index.html for some websites?

Tags:

wget

I'm trying to use wget command:

wget -p http://www.example.com

to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?

306

asked Jun 20 '12 16:06

Jay H

2 Answers

Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.

 wget -r -p http://www.example.com

The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.

So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:

 wget -r -p -e robots=off http://www.example.com

As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.

 wget -r -p -e robots=off -U mozilla http://www.example.com

A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)

you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:

wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com

175

answered Sep 26 '22 19:09

Ritesh Chandora

Firstly, to clarify the question, the aim is to download index.html plus all the requisite parts of that page (images, etc). The -p option is equivalent to --page-requisites.

The reason the page requisites are not always downloaded is that they are often hosted on a different domain from the original page (a CDN, for example). By default, wget refuses to visit other hosts, so you need to enable host spanning with the --span-hosts option.

wget --page-requisites --span-hosts 'http://www.amazon.com/'

If you need to be able to load index.html and have all the page requisites load from the local version, you'll need to add the --convert-links option, so that URLs in img src attributes (for example) are rewritten to relative URLs pointing to the local versions.

Optionally, you might also want to save all the files under a single "host" directory by adding the --no-host-directories option, or save all the files in a single, flat directory by adding the --no-directories option.

Using --no-directories will result in lots of files being downloaded to the current directory, so you probably want to specify a folder name for the output files, using --directory-prefix.

wget --page-requisites --span-hosts --convert-links --no-directories --directory-prefix=output 'http://www.amazon.com/'

answered Sep 22 '22 19:09

Alf Eaton

Related questions
                            
                                Bash - how to unzip a piped zip file (from "wget -qO-")
                            
                                Downloading Xcode with wget or curl
                            
                                Why does wget ignore the query string in the URL?
                            
                                Check wget's return value
                            
                                How to download a Google Drive url via curl or wget
                            
                                Download source code from Apple's website
                            
                                What does "wget -O" mean?
                            
                                wget - Download a sub directory
                            
                                How do I download and save a file locally on iOS using objective C? [duplicate]
                            
                                wget ssl alert handshake failure
                            
                                What is the correct wget command syntax for HTTPS with username and password?
                            
                                BASH script: Downloading consecutive numbered files with wget
                            
                                Download file with url redirection
                            
                                How to mirror only a section of a website?
                            
                                Using WGET to run a cronjob PHP
                            
                                Sites not accepting wget user agent header
                            
                                What headers are automatically sent by wget?
                            
                                How to download a file into a directory using curl or wget? [closed]
                            
                                How do I mirror a directory with wget without creating parent directories?
                            
                                get file size of a file to wget before wget-ing it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With