Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which Perl modules for good for data munging?

Nine years ago when I started to parsing HTML and free text with Perl I read the classic Data Munging with Perl. Does someone know if David is planning to update the book or if there are similar books or web pages where the new parsing modules like XML-Twig, Regexp-Grammars, etc, are explained?

I assume that in the last nine years some modules still are as good as they were, some are up to date but with new interesting methods and some have better replacements. For example, is still Parse-RecDescent the only option for free text parsing or will be the Perl 6 influenced Regexp-Grammars its replacement in many scenarios?

I have been four years without active HTML, XML or free text data mining with Perl, so probably my toolkit in this area is a bit outdated. Therefore any feedback for HTML and DOM manipulation, link extraction/verification, web-testing like Mechanize, XML manipulation and free text parsing , from people that is up to date with the current CPAN modules in this area will be more than welcome.

Some new additions to my toolkit:

  • JQuery
  • pQuery
  • HTML-SimpleLinkExtor
  • XML::Twig
  • HTML-Table

still in my toolkit:

  • HTML-TableExtract # not updated since 2006
  • WWW-Mechanize
  • Parse-RecDescent
  • HTML-TokeParser
  • URI-Escape
  • [more...]
like image 958
Pablo Marin-Garcia Avatar asked Sep 27 '10 00:09

Pablo Marin-Garcia


2 Answers

It's unlikely that there will ever be a second edition of "Data Munging with Perl". I'm afraid that the economics just don't stack up.

But, you're right that technology has moved on a long way since 2001 and there are plenty of new and improved modules that cover much of the same area as the modules discussed in the book, For example, I can't remember the last time I used XML::Parser or XML::DOM. I seem to use XML::LibXML for the majority of my XML work these days. Also, of course, my discussion of databases is incomplete because it doesn't mention DBIx::Class.

Perhaps it would be an interesting idea to update some of this information through some posts on my Perl blog. I'll give it some thought. Thanks for the idea.

like image 101
Dave Cross Avatar answered Oct 30 '22 00:10

Dave Cross


re: Parse::RecDescent <=> Regexp::Grammars

Damian Conway has been quoted saying that Regexp::Grammars is the successor to Parse::RecDescent. But even so if Parse::RecDescent still gets the job done for you then continue to use it. The tool you know well is better than the tool you don't know!

However if performance is a key issue and you are running perl 5.10+ then do consider Regexp::Grammars.

Hope Dave doesn't mind but here is his first Parse::RecDescent example from Data Munging with Perl (11.1.1) converted to Regexp::Grammars:

use 5.010;
use warnings;
use Regexp::Grammars;

my $parser = qr{
    <Sentence>

    <rule: Sentence>        <subject> <verb> <object>
    <rule: subject>         <noun_phrase>
    <rule: object>          <noun_phrase>
    <rule: noun_phrase>     <pronoun> | <proper_noun> | <article> <noun>

    <token: verb>           wrote | likes | ate
    <token: article>        a | the | this
    <token: pronoun>        it | he
    <token: proper_noun>    Perl | Dave | Larry
    <token: noun>           book | cat
}xms;

while (<DATA>) {
    chomp;
    print "'$_' is ";
    print 'NOT ' unless $_ =~ $parser;
    say 'a valid sentence';
}

__DATA__
Larry wrote Perl
Larry wrote a book
Dave likes Perl
Dave likes the book
Dave wrote this book
the cat ate the book
Dave got very angry

NB. For those you don't have the book only "Dave got very angry" is an invalid sentence :)

/I3az/

like image 34
draegtun Avatar answered Oct 30 '22 01:10

draegtun