Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the easiest way to programmatically extract structured data from a bunch of web pages?

What is the easiest way to programmatically extract structured data from a bunch of web pages?

I am currently using an Adobe AIR program I have written to follow the links on one page and grab a section of data off of the subsequent pages. This actually works fine, and for programmers I think this(or other languages) provides a reasonable approach, to be written on a case by case basis. Maybe there is a specific language or library that allows a programmer to do this very quickly, and if so I would be interested in knowing what they are.

Also do any tools exist which would allow a non-programmer, like a customer support rep or someone in charge of data acquisition, to extract structured data from web pages without the need to do a bunch of copy and paste?

like image 470
dennisjtaylor Avatar asked Dec 18 '09 19:12

dennisjtaylor


1 Answers

If you do a search on Stackoverflow for WWW::Mechanize & pQuery you will see many examples using these Perl CPAN modules.

However because you have mentioned "non-programmer" then perhaps Web::Scraper CPAN module maybe more appropriate? Its more DSL like and so perhaps easier for "non-programmer" to pick up.

Here is an example from the documentation for retrieving tweets from Twitter:

use URI;
use Web::Scraper;

my $tweets = scraper {
    process "li.status", "tweets[]" => scraper {
        process ".entry-content",    body => 'TEXT';
        process ".entry-date",       when => 'TEXT';
        process 'a[rel="bookmark"]', link => '@href';
    };
};

my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );

for my $tweet (@{$res->{tweets}}) {
    print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";
}
like image 141
draegtun Avatar answered Sep 22 '22 04:09

draegtun