Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl: HTML Scraping from an Authenticated website

While HTML Scraping is pretty well-documented from what I can see, and I understand the concept and implementation of it, what is the best method for scraping from content that is tucked away behind authentication forms. I refer to scraping from content that I legitimately have access to, so a method for automatically submitting login data is what I'm looking for.

All I can think of is setting up a proxy, capturing the throughput from a manual login, then setting up a script to spoof that throughput as part of the HTML scraping execution. As far as language goes, it would likely be done in Perl.

Has anyone had experience with this, or just a general thought?

Edit This has been answered before but with .NET. While it validates how I think it should be done, does anyone have Perl script to do this?

like image 954
IL. Avatar asked Jan 25 '23 01:01

IL.


1 Answers

Check out the Perl WWW::Mechanize library - it builds on LWP to provide tools for doing exactly the kind of interaction you refer to, and it can maintain state with cookies while you're about it!

WWW::Mechanize, or Mech for short, helps you automate interaction with a website. It supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited.

like image 192
Paul Dixon Avatar answered Feb 03 '23 13:02

Paul Dixon