I am trying to use a complex regex to match URLs in a body of text. The aim is to delimit the URLs in the text.
I would like to do something like the below
perl -pe 's/regex/left $1 right/g;' inputfile
which will substitute all occurrences of the regex with the matched value surrounded by the words left
and right
This is just a simplified example to illustrate the point - the real scenario has loads of -e
expressions and I am looking to add another for this particular matching purpose.
The regex is whever matches a URL. I realise matching URLs is very difficult and probably impossible to identify all possiblities but a reasonable approximation would be fine. I have found one such approximation at http://daringfireball.net/2010/07/improved_regex_for_matching_urls.
I cannot, however, that regex to work in a perl construct like the above. I have tried with different delimiters than /
for example ~
but without success.
Appendix B of RFC 2396 gives a regex for parsing URIs.
B. Parsing a URI Reference with a Regular Expression
As described in Section 4.3, the generic URI syntax is not sufficient to disambiguate the components of some forms of URI. Since the “greedy algorithm” described in that section is identical to the disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential four components and fragment identifier of a URI reference.
The following line is the regular expression for breaking-down a URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression n as
$<n>
. For example, matching the above expression tohttp://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http: $2 = http $3 = //www.ics.uci.edu $4 = www.ics.uci.edu $5 = /pub/ietf/uri/ $6 = <undefined> $7 = <undefined> $8 = #Related $9 = Related
where
<undefined>
indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the four components and fragment asscheme = $2 authority = $4 path = $5 query = $7 fragment = $9
and, going in the opposite direction, we can recreate a URI reference from its components using the algorithm in step 7 of Section 5.2.
The regex is directly usable in Perl, as in
if ($uri =~ m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!) {
my($host,$path) = ($4,$5);
print "$host => $path\n";
}
Greed in regex quantifiers will likely make this pattern challenging to use with s///
because it will consume as much text as possible, likely overrunning unmarked URI boundaries.
More directly applicable is the URI::Find module, available on CPAN. Circumscribing LEFT and RIGHT is as simple as
#! /usr/bin/env perl
use strict;
use warnings;
use URI::Find;
my $finder = URI::Find->new(sub {
my(undef,$found) = @_;
"LEFT $found RIGHT";
});
while (<>) {
$finder->find(\$_);
print;
}
Output:
$ cat input This is a plain text input suitable for an answer to a question on http://stackoverflow.com In particular, the question is available at http://stackoverflow.com/q/15233535/123109 and the answer at http://stackoverflow.com/a/15234378/123109 $ ./mark-uris input This is a plain text input suitable for an answer to a question on LEFT http://stackoverflow.com RIGHT In particular, the question is available at LEFT http://stackoverflow.com/q/15233535/123109 RIGHT and the answer at LEFT http://stackoverflow.com/a/15234378/123109 RIGHT
I have found an answer to this question, thanks to another question Using regex to extract URLs from plain text with Perl. The URL is much simpler than the one I was trying before but appears to work in the simple cases I have tested.
perl -i -pe 's,(http.*?://([^\s)\"](?!ttp:))+),left $& right,g;' myfile
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With