Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do I gain by filtering URLs through Perl's URI module?

Tags:

url

uri

perl

Do I gain something when I transform my $url like this: $url = URI->new( $url )?

#!/usr/bin/env perl
use warnings; use strict;
use 5.012;
use URI;
use XML::LibXML;

my $url = 'http://stackoverflow.com/';
$url = URI->new( $url );

my $doc = XML::LibXML->load_html( location => $url, recover => 2 );
my @nodes = $doc->getElementsByTagName( 'a' );
say scalar @nodes;
like image 938
sid_com Avatar asked Apr 24 '10 17:04

sid_com


3 Answers

The URI module constructor would clean up the URI for you - for example correctly escape the characters invalid for URI construction (see URI::Escape).

like image 56
DVK Avatar answered Nov 13 '22 17:11

DVK


The URI module as several benefits:

  • It normalizes the URL for you
  • It can resolve relative URLs
  • It can detect invalid URLs (although you need to turn off the schemeless bits)
  • You can easily filter the URLs that you want to process.

The benefit that you get with the little bit of code that you show is minimal, but as you continue to work on the problem, perhaps spidering the site, URI becomes more handy as you select what to do next.

like image 3
brian d foy Avatar answered Nov 13 '22 17:11

brian d foy


I'm surprised nobody has mentioned it yet, but$url = URI->new( $url ); doesn't clean up your $url and hand it back to you, it creates a new object of class URI (or, rather, of one if its subclasses) which can then be passed to other code which requires a URI object. That's not particularly important in this case, since XML::LibXML appears to be happy to accept locations as either strings or objects, but some other modules require you to give them a URI object and will reject URLs presented as plain strings.

like image 1
Dave Sherohman Avatar answered Nov 13 '22 16:11

Dave Sherohman