So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so: <pre class="prettyprint"><code>$text =~ s/<[^>]+>/ /g; $text =~ s/\s+/ /g; $text =~ s/[$\{\[]\d+[\(\{\[]/ /g; $text =~ s/\s+[<>]+\s+/\. /g; $text =~ s/\s+/ /g; $text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The $text =~ s/\.\s*\([^$]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S... </code></pre> As you can see, I'm dealing with nasty html and have to beat it into submission. I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above. I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so: <pre class="prettyprint"><code>%rxcheck = ( 'time of day'=>'\d+:\d+', 'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]', 'ends with a single capital letter'=>'\b[A-Z]\.' } </code></pre> And this is how I use it: <pre class="prettyprint"><code> foreach my $key (keys %rxcheck) { if($snippet =~ /$rxcheck{ $key }/g){ blah blah } } </code></pre> The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it. <pre class="prettyprint"><code>%rxcheck2 = ( '(\w) \"'=>'$1\"' } </code></pre> The above is to do this: <pre class="prettyprint"><code>$snippet =~ s/(\w) \"/$1\"/g; </code></pre> But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in: <pre class="prettyprint"><code>if($snippet =~ /$key/$rxcheck2{ $key }/g){ } </code></pre> And that doesn't work. So 2 questions: Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before? Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this? Thanks for your help -

<h3>Problem #1</h3> As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move <code>$text</code> into <code>$_</code>, so that instead of having to say: <pre class="prettyprint"><code>$text =~ s/foo/bar/g; </code></pre> You can just say: <pre class="prettyprint"><code>s/foo/bar/g; </code></pre> A common idiom for doing this is to use a degenerate <code>for()</code> loop as a topicalizer: <pre class="prettyprint"><code>for($text) { s/foo/bar/g; s/qux/meh/g; ... } </code></pre> The scope of this block will preserve any preexisting value of <code>$_</code>, so there's no need to explicitly <code>local</code>ize <code>$_</code>. At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory? Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes. <h3>Problem #2</h3> You can use the <code>qr//</code> syntax to quote the "search" part of the substitution: <pre class="prettyprint"><code>my $search = qr/(<[^>]+>)/; $str =~ s/$search/foo,$1,bar/; </code></pre> However I don't know of a way of quoting the "replacement" part adequately. I had hoped that <code>qr//</code> would work for this too, but it doesn't. There are two alternatives worth considering: 1. Use <code>eval()</code> in your <code>foreach</code> loop. This would enable you to keep your current <code>%rxcheck2</code> hash. Downside: you should always be concerned about safety with string <code>eval()</code>s. 2. Use an array of anonymous subroutines: <pre class="prettyprint"><code>my @replacements = ( sub { $_[0] =~ s/<[^>]+>/ /g; }, sub { $_[0] =~ s/\s+/ /g; }, sub { $_[0] =~ s/[$\{\[]\d+[\(\{\[]/ /g; }, sub { $_[0] =~ s/\s+[<>]+\s+/\. /g }, sub { $_[0] =~ s/\s+/ /g; }, sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; }, sub { $_[0] =~ s/\.\s*\([^$]*\) ([A-Z])/\. $1/g; } ); # Assume your data is in $_ foreach my $repl (@replacements) { &{$repl}($_); } </code></pre> You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

How can I efficiently handle multiple Perl search/replace operations on the same string?

Tags:

regex

hash

perl

substitution

So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:

$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The 
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...

As you can see, I'm dealing with nasty html and have to beat it into submission.

I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.

I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:

%rxcheck = (
        'time of day'=>'\d+:\d+', 
    'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
    'ends with a single capital letter'=>'\b[A-Z]\.'
}

And this is how I use it:

 foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah  }
 }

The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.

%rxcheck2 = (
        '(\w) \"'=>'$1\"'
}

The above is to do this:

$snippet =~ s/(\w) \"/$1\"/g;

But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:

if($snippet =~ /$key/$rxcheck2{ $key }/g){  }

And that doesn't work.

So 2 questions:

Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?

Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?

Thanks for your help -

292

asked May 09 '09 16:05

Jeff

2 Answers

Problem #1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text into $_, so that instead of having to say:

$text =~ s/foo/bar/g;

You can just say:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer:

for($text)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_, so there's no need to explicitly localize $_.

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?

Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.

Problem #2

You can use the qr// syntax to quote the "search" part of the substitution:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr// would work for this too, but it doesn't. There are two alternatives worth considering:

1. Use eval() in your foreach loop. This would enable you to keep your current %rxcheck2 hash. Downside: you should always be concerned about safety with string eval()s.

2. Use an array of anonymous subroutines:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

answered Nov 15 '22 06:11

j_random_hacker

You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.

A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.

answered Nov 15 '22 04:11

Sinan Ünür

Related questions
                            
                                Regex for disallowing commas
                            
                                How to extract date from string using javascript
                            
                                regex for preserving case pattern, capitalization
                            
                                How to remove HTML comments using Regex in Python
                            
                                How can I extract text from the middle of a string with Javascript?
                            
                                how to remove words of specific length in a string in R?
                            
                                Practical user validation (sensitivity and specificity)?
                            
                                Python regex to extract phone numbers from string
                            
                                How to replace all '\' to '/' in nodejs [duplicate]
                            
                                regex to check path is relative or absolute
                            
                                Regex split string by last occurrence of pattern
                            
                                How golang replace string by regex group? [duplicate]
                            
                                Snippet regex: match arbitrary number of groups and transform to CamelCase
                            
                                Get string between character using regex c# [duplicate]
                            
                                Remove doubles with no decimal places
                            
                                How to extract number from character string?
                            
                                Using str_extract in R to extract a number before a substring with regex
                            
                                print the last letter of each word to make a string using `awk` command
                            
                                C# string manipulation search and replace
                            
                                Regular Expression Uppercase Replacement in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With