I am trying to mirror a website at the moment. wget
seems to do the job very well, however it's not working on some pages.
Looking at the manual, the command
wget -r https://www.gnu.org/
should download the GNU page. And it actually does that. However, if I use another page, for example the startpage of my personal website, this doesn't work anymore.
wget -r https://my-personal.website
The index.html
is downloaded, but none of the CSS/JS not to mention the recursive download. All that is downloaded is the index.html
.
I've tried setting the User-Agent using the -U
option, but that didn't help either. Is there an option missing that is causing wget to stop after the index.html
?
UPDATE: I've also tried the --mirror
option, which is also not working and showing the same behavior.
Now, type the following arguments to get the following command: wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://site-to-download.com. Replace the https://site-to-download.com portion with the actual site URL you want to make a mirror of. You are done!
wget can follow links in HTML and XHTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site. This is sometimes referred to as recursive downloading.
Essentially, ' --no-parent ' is similar to ' -I/~luzer/my-archive ', only it handles redirections in a more intelligent fashion. Note that, for HTTP (and HTTPS), the trailing slash is very important to ' --no-parent '.
This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth.
Your website uses a relatively less-known form of robots control, through the <meta>
tag in HTML. You can read more about it here. Wget will correctly adhere to the instructions in this robots directive. You can see this happening, if you look a little closely at the debug output of Wget when trying to recursively download the website:
no-follow in my-personal.website/index.html: 1
Now, unfortunately, that's not a very helpful message unless you're one of the developers and know the codebase. I will try and update the message to be something a little more clear in this case. Just the way we do when such things happen due to a robots.txt
file.
Anyways, the fix is simple, disable robots parsing. While this is okay when accessing your own website, please be mindful about the web servers when doing this to others. The full command you need is:
$ wget -r -erobots=off https://my-personal.website
EDIT: As promised, added an improved message. See here. It now prints:
no-follow attribute found in my-personal.website/index.html. Will not follow any links on this page
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With