Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using wget but ignore url parameters

Tags:

I want to download the contents of a website where the URLs are built as

http://www.example.com/level1/level2?option1=1&option2=2

Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.

like image 566
cootje Avatar asked Nov 04 '14 13:11

cootje


2 Answers

You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.

wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/

This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.

like image 146
kenorb Avatar answered Sep 23 '22 20:09

kenorb


wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.

like image 36
Jan Joneš Avatar answered Sep 21 '22 20:09

Jan Joneš