While HTML Scraping is pretty well-documented from what I can see, and I understand the concept and implementation of it, what is the best method for scraping from content that is tucked away behind authentication forms. I refer to scraping from content that I legitimately have access to, so a method for automatically submitting login data is what I'm looking for.
All I can think of is setting up a proxy, capturing the throughput from a manual login, then setting up a script to spoof that throughput as part of the HTML scraping execution. As far as language goes, it would likely be done in Perl.
Has anyone had experience with this, or just a general thought?
Edit This has been answered before but with .NET. While it validates how I think it should be done, does anyone have Perl script to do this?
Check out the Perl WWW::Mechanize library - it builds on LWP to provide tools for doing exactly the kind of interaction you refer to, and it can maintain state with cookies while you're about it!
WWW::Mechanize, or Mech for short, helps you automate interaction with a website. It supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With