Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match strings that don't match a particular pattern in Perl?

Tags:

regex

perl

I know that it is easy to match anything except a given character using a regular expression.

$text = "ab ac ad";
$text =~ s/[^c]*//g; # Match anything, except c.

$text is now "c".

I don't know how to "except" strings instead of characters. How would I "match anything, except 'ac'" ? Tried [^(ac)] and [^"ac"] without success.

Is it possible at all?

like image 506
ssn Avatar asked Jan 21 '10 11:01

ssn


4 Answers

The following solves the question as understood in the second sense described in Bart K. comment:

>> $text='ab ac ad';
>> $text =~ s/(ac)|./\1/g;
>> print $text;
ac

Also, 'abacadac' -> 'acac'

It should be noted though that in most practical applications negative lookaheads prove to be more useful than this approach.

like image 167
Antony Hatchkins Avatar answered Nov 12 '22 23:11

Antony Hatchkins


If you just want to check if the string does not contain "ac", just use a negation.

$text = "ab ac ad";

print "ac not found" if $text !~ /ac/;

or

print "ac not found" unless $text =~ /ac/;
like image 22
Christoffer Hammarström Avatar answered Nov 12 '22 23:11

Christoffer Hammarström


$text =~ s/[^c]*//g; // Match anything, except c.

@ssn, A couple of comments about your question:

  1. "//" is not a comment in Perl. Only "#" is.
  2. "[^c]*" - there is no need for the "*" there. "[^c]" means the character class composed of all characters except the letter "c". Then you use the /g modifier, meaning all such occurrences in the text will be replaced (in your example, with nothing). The "zero or more" ("*") modifier is therefore redundant.

How would I "match anything, except 'ac'" ? Tried [^(ac)] and [^"ac"] without success.

Please read the documentation on character classes(See "perldoc perlre" on your command line, or online at http://perldoc.perl.org/perlre.html ) - you'll see it states that for the list of characters within the square brackets the RE will "match any character from the list". Meaning order is not relevant and there are no "strings", only a list of characters. "()" and double quotes also have no special meaning inside the square brackets.

Now I'm not exactly sure why you're talking about matching but then giving an example of substitution. But to see if a string does not match the sub-string "ac" you just need to negate the match:

use strict; use warnings;
my $text = "ab ac ad";
if ($text !~ m/ac/) {
   print "Yey the text doesn't match 'ac'!\n"; # this shouldn't be printed
}

Say you have a string of text within which are embedded multiple occurrences of a substring. If you just want the text surrounding the sub-string, just remove all occurrences of the sub-string:

$text =~ s/ac//g;

If you want the reverse - to remove all text except for all occurrences of the sub-string, I would suggest something like:

use strict; use warnings;
my $text = "ab ac ad ac ae";
my $sub_str = "ac";
my @captured = $text =~ m/($sub_str)/g;
my $num = scalar @captured;
print (($sub_str x $num) . "\n");

This basically counts the number of times the sub-string appears in the text and prints the sub-string that number of times using the "x" operator. Not very elegant, I'm sure a Perl-guru could come up with something better.


@ennuikiller:

my $text = "ab ac ad";
$text !~ s/(ac)//g; # Match anything, except ac.

This is incorrect, since it generates a warning ("Useless use of negative pattern binding (!~) in void context") under "use warnings" and doesn't do anything except remove all substrings "ac" from the text, which could be more simply written as I wrote above with:

$text =~ s/ac//g;
like image 2
Offer Kaye Avatar answered Nov 12 '22 23:11

Offer Kaye


Update: In a comment on your question, you mentioned you want to clean wiki markup and remove balanced sequences of {{ ... }}. Section 6 of the Perl FAQ covers this: Can I use Perl regular expressions to match balanced text?

Consider the following program:

#! /usr/bin/perl

use warnings;
use strict;

use Text::Balanced qw/ extract_tagged /;

# for demo only
*ARGV = *DATA;

while (<>) {
  if (s/^(.+?)(?=\{\{)//) {
    print $1;
    my(undef,$after) = extract_tagged $_, "{{" => "}}";

    if (defined $after) {
      $_ = $after;
      redo;
    }
  }

  print;
}

__DATA__
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. {{delete me}} Sed quis
nulla ut dolor {{me too}} fringilla
mollis {{ quis {{ ac }} erat.

Its output:

Lorem ipsum dolor sit amet, consectetur
adipiscing elit.  Sed quis
nulla ut dolor  fringilla
mollis {{ quis  erat.

For your particular example, you could use

$text =~ s/[^ac]|a(?!c)|(?<!a)c//g;

That is, only delete an a or c when they aren't part of an ac sequence.

In general, this is tricky to do with a regular expression.

Say you don't want foo followed by optional whitespace and then bar in $str. Often, it's clearer and easier to check separately. For example:

die "invalid string ($str)"
  if $str =~ /^.*foo\s*bar/;

You might also be interested in an answer to a similar question, where I wrote

my $nofoo = qr/
  (      [^f] |
    f  (?! o) |
    fo (?! o  \s* bar)
  )*
/x;

my $pattern = qr/^ $nofoo bar /x;

To understand the complication, read How Regexes Work by Mark Dominus. The engine compiles regular expressions into state machines. When it's time to match, it feeds the input string to the state machine and checks whether the state machine finishes in an accept state. So to exclude a string, you have to specify a machine that accepts all inputs except a particular sequence.

What might help is a /v regular expression switch that creates the state machine as usual but then complements the accept-state bit for all states. It's hard to say whether this would really be useful as compared with separate checks because a /v regular expression may still surprise people, just in different ways.

If you're interested in the theoretical details, see An Introduction to Formal Languages and Automata by Peter Linz.

like image 2
Greg Bacon Avatar answered Nov 12 '22 23:11

Greg Bacon