Normally pages that have a login form can be downloaded with
wget --no-check-certificate --save-cookies cookies --keep-session-cookies \
--post-data="username=example&password=example" \
"https://example.com/index.php?title=Special:Userlogin&returntotitle="
wget --no-check-certificate --load-cookies=cookies \
--no-parent -r --level=2 -nc -E \
https://example.com/Special:Sitemap
But in the case of DekiWiki sites, this doesn't work, if login is required.
The problem seams to be described in man wget
Note: if Wget is redirected after the POST request is completed, it will not send the POST data to the redirected URL. This is because URLs that process POST often respond with a redirection to a regular page, which does not desire or accept POST. It is not completely clear that this behavior is optimal; if it doesn't work out, it might be changed in the future.
Question
Can this be done using Perl e.g. with perhaps HTML::TreeBuilder 3
or HTML::TokeParser
or Mechanize
or any other Perl module?
Some sites that require a login do not send the cookie back with the response.
Instead they send a redirection response (302 Object Moved), which most browsers follow automatically and then the cookie is sent in the response for that redirect page.
I use curl to do this by enabling the curl_opt FOLLOW_LOCATION, for the command line tool one uses the -location option. It is a free tool like wget.
curl --cookie cookie.txt --cookie-jar cookie.txt \
--data-urlencode "username=example&password=example" \
--insecure --location https://example.com/index.php?title=Special:Userlogin&returntotitle= -o downloadedfile.html https://example.com/Special:Sitemap
http://curl.haxx.se/download.html
Also, sometimes a login form expects a multi-part/form-data post instead of just a application/x-www-form-urlencoded post. To make curl do a multi-part/form-data post change to he --data-urlencode to -F.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With