I have an array like this <pre class="prettyprint"><code>my @stopWords = ("and","this",....) </code></pre> My text is in this variable <pre class="prettyprint"><code>my $wholeText = "....and so this is...." </code></pre> I want to match every occurrence of every element of my stopWords array in the scalar wholeText and replace it with spaces. One way of doing this is as follows : <pre class="prettyprint"><code>foreach my $stopW (@stopWords) { $wholeText =~ s/$stopW/ /; } </code></pre> This works and replaces every occurrence of all the stop words. I was just wondering, if there is a shorter way of doing it. Like this: <pre class="prettyprint"><code>$wholeText =~ s/@stopWords/ /; </code></pre> The above does not seem to work though.

While the various <code>map</code>/<code>for</code>-based solutions will work, they'll also do regex processing of your string separately for each and every stopword. While this is no big deal in the example given, it can cause major performance issues as the target text and stopword list grow. Jonathan Leffler and Robert P are on the right track with their suggestions of mashing all the stopwords together into a single regex, but a simple <code>join</code> of all the stopwords into a single alternation is a crude approach and, again, becomes inefficient if the stopword list is long. Enter Regexp::Assemble, which will build you a much 'smarter' regex to handle all the matches at once - I've used it to good effect with lists of up to 1700 or so words to be checked against: <pre class="prettyprint"><code>#!/usr/bin/env perl use strict; use warnings; use 5.010; use Regexp::Assemble; my @stopwords = qw( and the this that a an in to ); my $whole_text = <<EOT; Fourscore and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. EOT my $ra = Regexp::Assemble->new(anchor_word_begin => 1, anchor_word_end => 1); $ra->add(@stopwords); say $ra->as_string; say '---'; my $re = $ra->re; $whole_text =~ s/$re//g; say $whole_text; </code></pre> Which outputs: <pre class="prettyprint"><code>\b(?:t(?:h(?:at|is|e)|o)|a(?:nd?)?|in)\b --- Fourscore seven years ago our fathers brought forth on continent new nation, conceived liberty, dedicated proposition all men are created equal. </code></pre>

My best solution: <pre class="prettyprint"><code>$wholeText =~ s/$_//g for @stopWords; </code></pre> You might want to sharpen the regexp using some <code>\b</code> and whitespace.

My paranoid version: <pre class="prettyprint"><code>$wholeText =~ s/\b\Q$_\E\b/ /gi for @stopWords; </code></pre> Use <code>\b</code> to match word boundaries, and <code>\Q..\E</code> just in case any of your stopwords contains characters which may be interpreted as "special" by the regex engine.

You could consider using a regex join to create a single regex. <pre class="prettyprint"><code>my $regex_str = join '|', map { quotemeta } @stopwords; $string =~ /$regex_str/ /g; </code></pre> Note that the <code>quotemeta</code> part just makes sure that any regex characters are properly escaped.

Can Perl substitution operator match an element in an array?

Tags:

perl

I have an array like this

my @stopWords = ("and","this",....)

My text is in this variable

my $wholeText = "....and so this is...."

I want to match every occurrence of every element of my stopWords array in the scalar wholeText and replace it with spaces.

One way of doing this is as follows :

foreach my $stopW (@stopWords)
{
   $wholeText =~ s/$stopW/ /;
}

This works and replaces every occurrence of all the stop words. I was just wondering, if there is a shorter way of doing it.

Like this:

$wholeText =~ s/@stopWords/ /;

The above does not seem to work though.

974

asked Oct 27 '10 06:10

Radz

5 Answers

While the various map/for-based solutions will work, they'll also do regex processing of your string separately for each and every stopword. While this is no big deal in the example given, it can cause major performance issues as the target text and stopword list grow.

Jonathan Leffler and Robert P are on the right track with their suggestions of mashing all the stopwords together into a single regex, but a simple join of all the stopwords into a single alternation is a crude approach and, again, becomes inefficient if the stopword list is long.

Enter Regexp::Assemble, which will build you a much 'smarter' regex to handle all the matches at once - I've used it to good effect with lists of up to 1700 or so words to be checked against:

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;

use Regexp::Assemble;

my @stopwords = qw( and the this that a an in to );

my $whole_text = <<EOT;
Fourscore and seven years ago our fathers brought forth
on this continent a new nation, conceived in liberty, and
dedicated to the proposition that all men are created equal.
EOT

my $ra = Regexp::Assemble->new(anchor_word_begin => 1, anchor_word_end => 1);
$ra->add(@stopwords);
say $ra->as_string;

say '---';

my $re = $ra->re;
$whole_text =~ s/$re//g;
say $whole_text;

Which outputs:

\b(?:t(?:h(?:at|is|e)|o)|a(?:nd?)?|in)\b
---
Fourscore  seven years ago our fathers brought forth
on  continent  new nation, conceived  liberty, 
dedicated   proposition  all men are created equal.

answered Oct 29 '22 00:10

Dave Sherohman

My best solution:

$wholeText =~ s/$_//g for @stopWords;

You might want to sharpen the regexp using some \b and whitespace.

answered Oct 29 '22 01:10

zoul

What about:

my $qrstring = '\b(' . (join '|', @stopWords) . ')\b';
my $qr = qr/$qrstring/;
$wholeText =~ s/$qr/ /g;

Concatenate all the words to form '\b(and|the|it|...)\b'; the parentheses around the join are necessary to give it a list context; without them, you end up with the count of the number of words). The '\b' metacharacters mark word boundaries, and therefore prevent you changing 'thousand' into 'thous'. Convert that into a quoted regular expression; apply it globally to your subject string (so that all occurrences of all stop words are removed in a single operation).

You can also do without the variable '$qr':

my $qrstring = '\b(' . (join '|', @stopWords) . ')\b';
$wholeText =~ s/$qrstring/ /g;

I don't think I'd care to maintain the code of anyone who managed to do without the variable '$qrstring'; it probably can be done, but I don't think it would be very readable.

answered Oct 28 '22 23:10

Jonathan Leffler

My paranoid version:

$wholeText =~ s/\b\Q$_\E\b/ /gi for @stopWords;

Use \b to match word boundaries, and \Q..\E just in case any of your stopwords contains characters which may be interpreted as "special" by the regex engine.

answered Oct 29 '22 01:10

mfontani

You could consider using a regex join to create a single regex.

my $regex_str = join '|', map { quotemeta } @stopwords;
$string =~ /$regex_str/ /g;

Note that the quotemeta part just makes sure that any regex characters are properly escaped.

answered Oct 28 '22 23:10

Robert P

Related questions
                            
                                perl - formatting DateTime output
                            
                                Join keys and values in perl
                            
                                How to prevent printing the variable name using `Data::Dumper`
                            
                                Which non-empty string does /^$/ match?
                            
                                Perl one-liner if else logic
                            
                                How are Perl's lexically-scoped pragmas implemented?
                            
                                Split by dot using Perl
                            
                                Find words with repeating characters
                            
                                Using Try::Tiny or Eval? [closed]
                            
                                Clean way to split an array into thirds and display to user
                            
                                How to rewind next-search start position by 1?
                            
                                Bash regex string variable match
                            
                                Perl6 vs Perl5 benchmarking using prime numbers
                            
                                Should Perl hashes always contain values?
                            
                                How to create POD and use pod2usage in perl?
                            
                                Removing files with duplicate content from single directory [Perl, or algorithm]
                            
                                How do I set a ulimit from inside a Perl script that applies to its children?
                            
                                How can I use Moose with Test::Class?
                            
                                How can I redirect STDOUT and STDERR to a log file in Perl? [duplicate]
                            
                                Can I set a single signal handler for all signals in Perl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With