Checking for dead links locally in a static website (using wget?)

Question

A very nice tool to check for dead links (e.g. links pointing to 404 errors) is wget --spider. However, I have a slightly different use-case where I generate a static website, and want to check for broken links before uploading. More precisely, I want to check both:

Relative links like <a href="some/file.pdf">file.pdf</a>
Absolute links, most likely to external sites like <a href="http://example.com">example</a>.

I tried wget --spyder --force-html -i file-to-check.html, which reads the local file, considers it as HTML and follows each links. Unfortunately, it can't deal with relative links within the local HTML file (errors out with Cannot resolve incomplete link some/file.pdf). I tried using file:// but wget does not support it.

Currently, I have a hack based on running a local webserver throught python3 http.serve and checking the local files through HTTP:

python3 -m http.server &
pid=$! 
sleep .5
error=0
wget --spider -nd -nv -H -r -l 1 http://localhost:8000/index.html || error=$? 
kill $pid
wait $pid
exit $error

I'm not really happy with this for several reasons:

I need this sleep .5 to wait for the webserver to be ready. Without it, the script fails, but I can't guarantee that 0.5 seconds will be enough. I'd prefer having a way to start the wget command when the server is ready.
Conversely, this kill $pid feels ugly.

Ideally, python3 -m http.server would have an option to run a command when the server is ready and would shutdown itself after the command is completed. That sounds doable by writing a bit of Python, but I was wondering whether a cleaner solution exists.

Did I miss anything? Is there a better solution? I'm mentioning wget in my question because it does almost what I want, but using wget is not a requirement for me (nor is python -m http.server). I just need to have something easy to run and automate on Linux.

Tarun Lalwani · Accepted Answer

So I think you are running in the right direction. I would use wget and python as they are two readily available options on many systems. And the good part is that it gets the job done for you. Now what you want is to listen for Serving HTTP on 0.0.0.0 from the stdout of that process.

So I would start the process using something like below

python3 -u -m http.server > ./myserver.log &

Note the -u I have used here for unbuffered output, this is really important

Now next is waiting for this text to appear in myserver.log

timeout 10 awk '/Serving HTTP on 0.0.0.0/{print; exit}' <(tail -f ./myserver.log)

So 10 seconds is your maximum wait time here. And rest is self-explanatory. Next about your kill $pid. I don't think it is a problem, but if you want it to be more like the way a user does it then I would change it to

kill -s SIGINT $pid

This will be equivalent to you processing CTRL+C after launching the program. Also I would handle the SIGINT my bash script as well using something like below

https://unix.stackexchange.com/questions/313644/execute-command-or-function-when-sigint-or-sigterm-is-send-to-the-parent-script/313648

The above basically adds below to top of the bash script to handle you killing the script using CTRL+C or external kill signal

#!/bin/bash
exit_script() {
    echo "Printing something special!"
    echo "Maybe executing other commands!"
    trap - SIGINT SIGTERM # clear the trap
    kill -- -$$ # Sends SIGTERM to child/sub processes
}

trap exit_script SIGINT SIGTERM

Checking for dead links locally in a static website (using wget?)

Tags:

python

hyperlink

wget

http-status-code-404

Matthieu Moy

1 Answers

Tarun Lalwani

Recent Activity

Donate For Us

Checking for dead links locally in a static website (using wget?)

Tags:

python

hyperlink

wget

http-status-code-404

Matthieu Moy

1 Answers

Tarun Lalwani

Related questions

Recent Activity

Donate For Us