So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:
$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...
As you can see, I'm dealing with nasty html and have to beat it into submission.
I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.
I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:
%rxcheck = (
'time of day'=>'\d+:\d+',
'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
'ends with a single capital letter'=>'\b[A-Z]\.'
}
And this is how I use it:
foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah }
}
The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.
%rxcheck2 = (
'(\w) \"'=>'$1\"'
}
The above is to do this:
$snippet =~ s/(\w) \"/$1\"/g;
But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:
if($snippet =~ /$key/$rxcheck2{ $key }/g){ }
And that doesn't work.
So 2 questions:
Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?
Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?
Thanks for your help -
Performing a regex search-and-replace is just as easy: $string =~ s/regex/replacement/g; I added a “g” after the last forward slash. The “g” stands for “global”, which tells Perl to replace all matches, and not just the first one.
!~ is the negation of the binding operator =~ , like != is the negation of the operator == . The expression $foo !~ /bar/ is equivalent, but more concise, and sometimes more expressive, than the expression !($foo =~ /bar/)
$1 equals the text " brown ".
Regular Expression (Regex or Regexp or RE) in Perl is a special text string for describing a search pattern within a given text. Regex in Perl is linked to the host language and is not the same as in PHP, Python, etc. Sometimes it is termed as “Perl 5 Compatible Regular Expressions“.
As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text
into $_
, so that instead of having to say:
$text =~ s/foo/bar/g;
You can just say:
s/foo/bar/g;
A common idiom for doing this is to use a degenerate for()
loop as a topicalizer:
for($text)
{
s/foo/bar/g;
s/qux/meh/g;
...
}
The scope of this block will preserve any preexisting value of $_
, so there's no need to explicitly local
ize $_
.
At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?
Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.
You can use the qr//
syntax to quote the "search" part of the substitution:
my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;
However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr//
would work for this too, but it doesn't. There are two alternatives worth considering:
1. Use eval()
in your foreach
loop. This would enable you to keep your current %rxcheck2
hash. Downside: you should always be concerned about safety with string eval()
s.
2. Use an array of anonymous subroutines:
my @replacements = (
sub { $_[0] =~ s/<[^>]+>/ /g; },
sub { $_[0] =~ s/\s+/ /g; },
sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
sub { $_[0] =~ s/\s+/ /g; },
sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);
# Assume your data is in $_
foreach my $repl (@replacements) {
&{$repl}($_);
}
You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.
You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.
A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With