Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wget appends query string to resulting file

Tags:

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:

wget -p -k http://www.example.com 

In these cases I will end up with index.html and the needed CSS/JS etc.

HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.

Example

www.onlinetechvision.com/?p=566 

Combined with the above wget command will result in:

index.html?page=566 

I have tried using the --restrict-file-names=windows option, but that only gets me to

index.html@page=566 

Can anyone explain why this is needed and how I can end up with a regular index.html file?

UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.

However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?

like image 747
user1914292 Avatar asked Nov 08 '13 17:11

user1914292


2 Answers

If you try with parameter "--adjust-extension"

wget -p -k --adjust-extension  www.onlinetechvision.com/?p=566  

you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html@p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.

If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/

like image 165
TadejP Avatar answered Sep 18 '22 09:09

TadejP


To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.

Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.

like image 33
Tim Pierce Avatar answered Sep 22 '22 09:09

Tim Pierce