Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wget recursive/mirror option not following links

Tags:

shell

wget

I am trying to mirror a website at the moment. wget seems to do the job very well, however it's not working on some pages.

Looking at the manual, the command

wget -r https://www.gnu.org/

should download the GNU page. And it actually does that. However, if I use another page, for example the startpage of my personal website, this doesn't work anymore.

wget -r https://my-personal.website

The index.html is downloaded, but none of the CSS/JS not to mention the recursive download. All that is downloaded is the index.html.

I've tried setting the User-Agent using the -U option, but that didn't help either. Is there an option missing that is causing wget to stop after the index.html?

UPDATE: I've also tried the --mirror option, which is also not working and showing the same behavior.

like image 886
nachtjasmin Avatar asked Feb 19 '19 16:02

nachtjasmin


People also ask

How do I mirror a website using wget?

Now, type the following arguments to get the following command: wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://site-to-download.com. Replace the https://site-to-download.com portion with the actual site URL you want to make a mirror of. You are done!

What is wget recursive?

wget can follow links in HTML and XHTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site. This is sometimes referred to as recursive downloading.

What is -- no parent in wget?

Essentially, ' --no-parent ' is similar to ' -I/~luzer/my-archive ', only it handles redirections in a more intelligent fashion. Note that, for HTTP (and HTTPS), the trailing slash is very important to ' --no-parent '.

What are recursive downloads?

This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth.


1 Answers

Your website uses a relatively less-known form of robots control, through the <meta> tag in HTML. You can read more about it here. Wget will correctly adhere to the instructions in this robots directive. You can see this happening, if you look a little closely at the debug output of Wget when trying to recursively download the website:

no-follow in my-personal.website/index.html: 1

Now, unfortunately, that's not a very helpful message unless you're one of the developers and know the codebase. I will try and update the message to be something a little more clear in this case. Just the way we do when such things happen due to a robots.txt file.

Anyways, the fix is simple, disable robots parsing. While this is okay when accessing your own website, please be mindful about the web servers when doing this to others. The full command you need is:

$ wget -r -erobots=off https://my-personal.website

EDIT: As promised, added an improved message. See here. It now prints:

no-follow attribute found in my-personal.website/index.html. Will not follow any links on this page

like image 133
darnir Avatar answered Sep 30 '22 21:09

darnir