I'm trying to make a mirror of a website, but the URLs include several paths that overlap when copied to files on disk in the normal wget
way. The problem manifests with URLs like http://example.com/news
and http://example.com/news/article1
.
Wget downloads these URLs as /news
and /news/article1
, but that means that the /news
file is overwritten by a folder with the same name.
A proper static mirror would require that these two URLs be downloaded instead as /news/index.html
and /news/article1
.
I have tried to work around this problem by running wget
twice and moving the files accordingly, but that hasn't worked well for me. The /news
path has links to /news/article1
that need to be converted. I'm using the -k
option to convert links, but if I run wget
twice, it doesn't convert the links between these unrelated downloaded files.
Here's my command:
wget -p -r -l4 -k -d -nH http://example.com
Here's an example of the work around that I've tried:
# wget once at first level (gets /news path but not /news/*)
wget -p -r -l1 -k -nH http://example.com
# move /news file to temp path
mv news /tmp/news.html
# wget again to get everything else (notice the different level value)
wget -p -r -l4 -k -nH http://example.com
# move temp path back to /news/index.html
mv /tmp/news.html news/index.html
In the above example, the links on the /news
page that are supposed to point to /news/article1
have not been converted.
Does anybody know how to work around this with wget
? Is there a different tool that would work better?
I figured it out!
The problem was my assumption that /news/index.html
was the URL that I needed. After closely reading the man page, I found that -E (--adjust-extension)
solved my problem. This flag forces wget
to apply the .html
extension onto all of the HTML files that it downloads.
Coupling that with -k
to convert the links results in a 100% usable mirror that has all of the pages needed.
Here's an example map of the downloaded files and paths:
http://example.com/news --> /news.html
http://example.com/news/article1 --> /news/article1.html
As a functional mirror, this is great. Default webserver configurations (at least for Apache) seem to allow the path http://sitemirror.com/news/article1
to load the /news/article1.html
content. However, it may be necessary for a rewrite to keep the http:/sitemirror.com/news
path from displaying a 404 or index for the folder. This should not be tough.
Oh, so here's my final wget
command:
wget -p -r -l4 -E -k -nH http://example.com
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With