wget recursive/mirror option not following links

Tags:

wget

I am trying to mirror a website at the moment. wget seems to do the job very well, however it's not working on some pages.

Looking at the manual, the command

wget -r https://www.gnu.org/

should download the GNU page. And it actually does that. However, if I use another page, for example the startpage of my personal website, this doesn't work anymore.

wget -r https://my-personal.website

The index.html is downloaded, but none of the CSS/JS not to mention the recursive download. All that is downloaded is the index.html.

I've tried setting the User-Agent using the -U option, but that didn't help either. Is there an option missing that is causing wget to stop after the index.html?

UPDATE: I've also tried the --mirror option, which is also not working and showing the same behavior.

886

asked Feb 19 '19 16:02

nachtjasmin

1 Answers

Your website uses a relatively less-known form of robots control, through the <meta> tag in HTML. You can read more about it here. Wget will correctly adhere to the instructions in this robots directive. You can see this happening, if you look a little closely at the debug output of Wget when trying to recursively download the website:

no-follow in my-personal.website/index.html: 1

Now, unfortunately, that's not a very helpful message unless you're one of the developers and know the codebase. I will try and update the message to be something a little more clear in this case. Just the way we do when such things happen due to a robots.txt file.

Anyways, the fix is simple, disable robots parsing. While this is okay when accessing your own website, please be mindful about the web servers when doing this to others. The full command you need is:

$ wget -r -erobots=off https://my-personal.website

EDIT: As promised, added an improved message. See here. It now prints:

no-follow attribute found in my-personal.website/index.html. Will not follow any links on this page

133

answered Sep 30 '22 21:09

darnir

Related questions
                            
                                Trim first 9 letters using awk, sed
                            
                                Sum number in two different files
                            
                                How to make byobu forward-word and backward-word with CTRL+arrow?
                            
                                How to run a TCL script from shell script?
                            
                                Windows, Emacs, Git Bash, and shell-command
                            
                                Limit SSH User to a shell command
                            
                                Count number of column in a pipe delimited file
                            
                                Simplest method to convert file-size with suffix to bytes
                            
                                Convert an output to string
                            
                                "Illegal Byte sequence" error while using shell commands in mac bash terminal
                            
                                How to run SQL in shell script
                            
                                how to zip particular folders in linux using shell
                            
                                Call a Unix Script from Excel Vba
                            
                                Unix command to search and replace text recursively [closed]
                            
                                How to list only the dot files and dot folder names (without the content in them)
                            
                                Is there a way to run a JavaScript file inside the Node.js shell?
                            
                                How can I source an R file from the parent directory via the shell?
                            
                                how does `shopt -s lastpipe` affect bash script behavior?
                            
                                How to check if the file is downloaded with curl
                            
                                Bash, is subshell output implicitly quoted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With