Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl regex substitution for a URL

Tags:

regex

url

perl

I am trying to use a complex regex to match URLs in a body of text. The aim is to delimit the URLs in the text.

I would like to do something like the below

perl -pe 's/regex/left $1 right/g;' inputfile

which will substitute all occurrences of the regex with the matched value surrounded by the words left and right This is just a simplified example to illustrate the point - the real scenario has loads of -e expressions and I am looking to add another for this particular matching purpose.

The regex is whever matches a URL. I realise matching URLs is very difficult and probably impossible to identify all possiblities but a reasonable approximation would be fine. I have found one such approximation at http://daringfireball.net/2010/07/improved_regex_for_matching_urls.

I cannot, however, that regex to work in a perl construct like the above. I have tried with different delimiters than / for example ~ but without success.

like image 477
starfry Avatar asked Jan 14 '23 01:01

starfry


2 Answers

Appendix B of RFC 2396 gives a regex for parsing URIs.

B. Parsing a URI Reference with a Regular Expression

As described in Section 4.3, the generic URI syntax is not sufficient to disambiguate the components of some forms of URI. Since the “greedy algorithm” described in that section is identical to the disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential four components and fragment identifier of a URI reference.

The following line is the regular expression for breaking-down a URI reference into its components.

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression n as $<n>. For example, matching the above expression to

http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

where <undefined> indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the four components and fragment as

scheme    = $2
authority = $4
path      = $5
query     = $7
fragment  = $9

and, going in the opposite direction, we can recreate a URI reference from its components using the algorithm in step 7 of Section 5.2.

The regex is directly usable in Perl, as in

if ($uri =~ m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!) {
    my($host,$path) = ($4,$5);
    print "$host => $path\n";
}

Greed in regex quantifiers will likely make this pattern challenging to use with s/// because it will consume as much text as possible, likely overrunning unmarked URI boundaries.

More directly applicable is the URI::Find module, available on CPAN. Circumscribing LEFT and RIGHT is as simple as

#! /usr/bin/env perl

use strict;
use warnings;

use URI::Find;

my $finder = URI::Find->new(sub {
    my(undef,$found) = @_;
    "LEFT $found RIGHT";
});

while (<>) {
    $finder->find(\$_);
    print;
}

Output:

$ cat input
This is a plain text input suitable for
an answer to a question on http://stackoverflow.com

In particular, the question is available at
http://stackoverflow.com/q/15233535/123109 and the answer
at http://stackoverflow.com/a/15234378/123109

$ ./mark-uris input
This is a plain text input suitable for
an answer to a question on LEFT http://stackoverflow.com RIGHT

In particular, the question is available at
LEFT http://stackoverflow.com/q/15233535/123109 RIGHT and the answer
at LEFT http://stackoverflow.com/a/15234378/123109 RIGHT
like image 192
Greg Bacon Avatar answered Jan 19 '23 00:01

Greg Bacon


I have found an answer to this question, thanks to another question Using regex to extract URLs from plain text with Perl. The URL is much simpler than the one I was trying before but appears to work in the simple cases I have tested.

perl -i -pe 's,(http.*?://([^\s)\"](?!ttp:))+),left $& right,g;' myfile
like image 34
starfry Avatar answered Jan 18 '23 23:01

starfry