I've got GNU Wget 1.10.2 for windows and linux and the -k option behaves differently on those two.
-k, --convert-links make links in downloaded HTML point to local files.
On windows it produces:
www.example.com/index.html www.example.com/index.html@page=about www.example.com/index.html@page=contact www.example.com/index.html@page=sitemap
and on linux it produces:
www.example.com/index.html www.example.com/index.html?page=about www.example.com/index.html?page=contact www.example.com/index.html?page=sitemap
This is problematic in linux because when I serve the mirror through Apache it will not distinguish between the 4 generated pages since the part after the questionmark (?) character is used as the query string to the file.
Any ideas on how I can control this?
thanks
You can't use a question mark (?) in a filename on NTFS or FAT32. This is why wget uses the at symbol (@) instead.
In Linux, only a slash (/) is forbidden on most filesystems, so wget uses the question mark (since it's part of the URI).
You can force either behaviour by using --restrict-file-names=unix
or --restrict-file-names=windows
.
From the wget documentation:
When mode is set to “unix”, Wget escapes the character ‘/’ and the control characters in the ranges 0–31 and 128–159. This is the default on Unix-like OS'es.
When mode is set to “windows”, Wget escapes the characters ‘\’, ‘|’, ‘/’, ‘:’, ‘?’, ‘"’, ‘*’, ‘<’, ‘>’, and the control characters in the ranges 0–31 and 128–159. In addition to this, Wget in Windows mode uses ‘+’ instead of ‘:’ to separate host and port in local file names, and uses ‘@’ instead of ‘?’ to separate the query portion of the file name from the rest. Therefore, a URL that would be saved as ‘
www.xemacs.org:4300/search.pl?input=blah
’ in Unix mode would be saved as ‘www.xemacs.org+4300/search.pl@input=blah
’ in Windows mode. This mode is the default on Windows.
This is problematic in linux because when I serve the mirror through Apache it will not distinguish between the 4 generated pages since the part after the questionmark (?) character is used as the query string to the file.
To include a question mark in a URL path part, you can escape it:
www.example.com/index.html%3Fpage=about
--convert-links should be doing this for you, I'd think — may be a bug if not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With