Using wget but ignore url parameters

Question

I want to download the contents of a website where the URLs are built as

http://www.example.com/level1/level2?option1=1&option2=2

Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.

kenorb · Accepted Answer

You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.

wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/

This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.

Jan Joneš · Answer

wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.

Using wget but ignore url parameters

Tags:

cootje

2 Answers

kenorb

Jan Joneš

Recent Activity

Donate For Us

Using wget but ignore url parameters

Tags:

cootje

2 Answers

kenorb

Jan Joneš

Related questions

Recent Activity

Donate For Us