Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to write a maintainable web scraping app?

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date.

So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.

like image 986
Benj Avatar asked Nov 09 '09 11:11

Benj


3 Answers

In Perl, something like WWW::Mechanize can already make your script more simple and robust, because it can find HTML forms in previous responses from the website. You can fill in these forms to prepare a new request. For example:

my $mech = WWW::Mechanize->new();
$mech->get($url);
$mech->submit_form(
    form_number => 1,
    fields      => { password => $password },
);
die unless ($mech->success);
like image 79
Bruno De Fraine Avatar answered Nov 19 '22 10:11

Bruno De Fraine


A combination of WWW::Mechanize and Web::Scraper are the two tools that make me most productive. Theres a nice article about that combination at the catalyzed.org

like image 29
singingfish Avatar answered Nov 19 '22 11:11

singingfish


If I were to give you one advice, it would be to use XPath for all your scraping needs. Avoid regexes.

like image 6
Geo Avatar answered Nov 19 '22 11:11

Geo