Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wget -k converts files differently on Windows and Linux

Tags:

apache

wget

I've got GNU Wget 1.10.2 for windows and linux and the -k option behaves differently on those two.

-k, --convert-links make links in downloaded HTML point to local files.

On windows it produces:

www.example.com/index.html
www.example.com/index.html@page=about
www.example.com/index.html@page=contact
www.example.com/index.html@page=sitemap

and on linux it produces:

www.example.com/index.html
www.example.com/index.html?page=about
www.example.com/index.html?page=contact
www.example.com/index.html?page=sitemap

This is problematic in linux because when I serve the mirror through Apache it will not distinguish between the 4 generated pages since the part after the questionmark (?) character is used as the query string to the file.

Any ideas on how I can control this?

thanks

like image 888
cherouvim Avatar asked Mar 10 '09 11:03

cherouvim


2 Answers

You can't use a question mark (?) in a filename on NTFS or FAT32. This is why wget uses the at symbol (@) instead.

In Linux, only a slash (/) is forbidden on most filesystems, so wget uses the question mark (since it's part of the URI).

You can force either behaviour by using --restrict-file-names=unix or --restrict-file-names=windows.

From the wget documentation:

When mode is set to “unix”, Wget escapes the character ‘/’ and the control characters in the ranges 0–31 and 128–159. This is the default on Unix-like OS'es.

When mode is set to “windows”, Wget escapes the characters ‘\’, ‘|’, ‘/’, ‘:’, ‘?’, ‘"’, ‘*’, ‘<’, ‘>’, and the control characters in the ranges 0–31 and 128–159. In addition to this, Wget in Windows mode uses ‘+’ instead of ‘:’ to separate host and port in local file names, and uses ‘@’ instead of ‘?’ to separate the query portion of the file name from the rest. Therefore, a URL that would be saved as ‘www.xemacs.org:4300/search.pl?input=blah’ in Unix mode would be saved as ‘www.xemacs.org+4300/search.pl@input=blah’ in Windows mode. This mode is the default on Windows.

like image 130
Can Berk Güder Avatar answered Sep 19 '22 06:09

Can Berk Güder


This is problematic in linux because when I serve the mirror through Apache it will not distinguish between the 4 generated pages since the part after the questionmark (?) character is used as the query string to the file.

To include a question mark in a URL path part, you can escape it:

www.example.com/index.html%3Fpage=about

--convert-links should be doing this for you, I'd think — may be a bug if not.

like image 28
bobince Avatar answered Sep 18 '22 06:09

bobince