Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Negating bracketed character classes in Perl regular expressions and grep

Tags:

arrays

regex

perl

I'm attempting to solve a very simple problem - find strings in an array which only contain certain letters. However, I've run up against something in the behavior of regular expressions and/or grep that I don't get.

#!/usr/bin/perl

use warnings;
use strict;

my @test_data = qw(ant bee cat dodo elephant frog giraffe horse);

# Words wanted include these letters only. Hardcoded for demonstration purposes
my @wanted_letters = qw/a c d i n o t/;

# Subtract those letters from the alphabet to find the letters to eliminate.
# Interpolate array into a negated bracketed character class, positive grep
# against a list of the lowercase alphabet: fine, gets befghjklmpqrsuvwxyz.
my @unwanted_letters = grep(/[^@wanted_letters]/, ('a' .. 'z'));

# The desired result can be simulated by hardcoding the unwanted letters into a
# bracketed character class then doing a negative grep: matches ant, cat, and dodo.
my @works = grep(!/[befghjklmpqrsuvwxyz]/, @test_data);

# Doing something similar but moving the negation into the bracketed character
# class fails and matches everything.
my @fails1 = grep(/[^befghjklmpqrsuvwxyz]/, @test_data);

# Doing the same thing that produced the array of unwanted letters also fails.
my @fails2 = grep(/[^@unwanted_letters]/, @test_data);

print join ' ', @works; print "\n";
print join ' ', @fails1; print "\n";
print join ' ', @fails2; print "\n";

Questions:

  • Why does @works get the correct result but not @fails1? The grep docs suggest the former, and the negation section of perlrecharclass suggests the latter, although it uses =~ in its example. Is this something specifically to do with using grep?
  • Why does @fails2 not work? Is it something to do with array vs list context? It otherwise looks the same as the subtraction step.
  • Besides that, is there a pure regex way to achieve this that avoids the subtraction step?
like image 826
Scott Martin Avatar asked Nov 01 '21 18:11

Scott Martin


People also ask

How do you negate a character in regex?

Similarly, the negation variant of the character class is defined as "[^ ]" (with ^ within the square braces), it matches a single character which is not in the specified or set of possible characters. For example the regular expression [^abc] matches a single character except a or, b or, c.

What is \W in Perl regex?

Under /a , \d always means precisely the digits "0" to "9" ; \s means the five characters [ \f\n\r\t] , and starting in Perl v5. 18, the vertical tab; \w means the 63 characters [A-Za-z0-9_] ; and likewise, all the Posix classes such as [[:print:]] match only the appropriate ASCII-range characters.

What is the meaning of [] in regex?

The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a [] . It provides a range construct. The regex [a-z] will match any letter a through z.

What are character classes in Perl regular expressions?

There are three types of character classes in Perl regular expressions: the dot, backslash sequences, and the form enclosed in square brackets. Keep in mind, though, that often the term "character class" is used to mean just the bracketed form.

What are the regular expressions in grep?

Regular Expressions in Grep (Regex) 1 Grep Regular Expression #. A regular expression or regex is a pattern that matches a set of strings. ... 2 Literal Matches #. The most basic usage of the grep command is to search for a literal character or series of characters in a file. 3 Anchoring #. ... 4 Bracket Expressions #. ...

How do you use square brackets in regular expressions?

You use square brackets [] to denote the character classes in regular expressions. Inside the square brackets, you put a set of characters that possibly matches. Take a look at the following example: The regular expression / [dfr]og/ matches dog, fog, rog, etc.

How to use abbreviations in regular expressions in Perl?

To make the regular expressions more readable, Perl provides useful predefined abbreviations for common character classes as shown below: d matches a digit, from 0 to 9 [0-9] s matches a whitespace character, that is a space, tab, newline, carriage return, formfeed. [tnrf] w matches a “word” character (alphanumeric or _) [0-9a-zA-Z_].


2 Answers

Both fails are fixed with the addition of anchors ^ and $ and quantifier +

These both work:

my @fails1 = grep(/^[^befghjklmpqrsuvwxyz]+$/, @test_data);
my @fails2 = grep(/^[^@unwanted_letters]+$/, @test_data);

Keep in mind that /[^befghjklmpqrsuvwxyz]/ or /[^@unwanted_letters]/ only matches ONE character. Adding + means as many as possible. Adding ^ and $ means all characters from the start to the end of the string.

With /[@wanted_letters]/ you will return a match if there is a single wanted character (even with unwanted characters in the string) -- the logical equivalent to any. Compare to /^[@wanted_letters]+$/ where all the letters need to be in the set of @wanted_letters and is the equivalent of all.

Demo1 only ONE character so grep fails.

Demo2 quantifier means more than one but no anchor - grep fails

Demo3 Anchors and quantifier - expected result.

Once you understand character classes only match ONE character and anchors for the WHOLE string and quantifiers for everything extending the match to the anchors, you can directly grep just with wanted letters:

my @wanted = grep(/^[@wanted_letters]+$/, @test_data);
like image 185
dawg Avatar answered Nov 14 '22 02:11

dawg


You're matching something outside the character set anywhere in the string. But it can still have characters in the character set somewhere else in the string. For instance, if the test word is elephant, the negated character class matches the a character.

If you want to test the whole string, you need to quantify it and anchor to the ends.

grep(/^[^befghjklmpqrsuvwxyz]*$/, @test_data);

Translated into English, it's the difference between "word contains no characters in the set" and "word contains a character not in the set".

like image 44
Barmar Avatar answered Nov 14 '22 04:11

Barmar