I'm attempting to solve a very simple problem - find strings in an array which only contain certain letters. However, I've run up against something in the behavior of regular expressions and/or grep
that I don't get.
#!/usr/bin/perl
use warnings;
use strict;
my @test_data = qw(ant bee cat dodo elephant frog giraffe horse);
# Words wanted include these letters only. Hardcoded for demonstration purposes
my @wanted_letters = qw/a c d i n o t/;
# Subtract those letters from the alphabet to find the letters to eliminate.
# Interpolate array into a negated bracketed character class, positive grep
# against a list of the lowercase alphabet: fine, gets befghjklmpqrsuvwxyz.
my @unwanted_letters = grep(/[^@wanted_letters]/, ('a' .. 'z'));
# The desired result can be simulated by hardcoding the unwanted letters into a
# bracketed character class then doing a negative grep: matches ant, cat, and dodo.
my @works = grep(!/[befghjklmpqrsuvwxyz]/, @test_data);
# Doing something similar but moving the negation into the bracketed character
# class fails and matches everything.
my @fails1 = grep(/[^befghjklmpqrsuvwxyz]/, @test_data);
# Doing the same thing that produced the array of unwanted letters also fails.
my @fails2 = grep(/[^@unwanted_letters]/, @test_data);
print join ' ', @works; print "\n";
print join ' ', @fails1; print "\n";
print join ' ', @fails2; print "\n";
Questions:
@works
get the correct result but not @fails1
? The grep
docs suggest the former, and the negation section of perlrecharclass
suggests the latter, although it uses =~
in its example. Is this something specifically to do with using grep
?@fails2
not work? Is it something to do with array vs list context? It otherwise looks the same as the subtraction step.Similarly, the negation variant of the character class is defined as "[^ ]" (with ^ within the square braces), it matches a single character which is not in the specified or set of possible characters. For example the regular expression [^abc] matches a single character except a or, b or, c.
Under /a , \d always means precisely the digits "0" to "9" ; \s means the five characters [ \f\n\r\t] , and starting in Perl v5. 18, the vertical tab; \w means the 63 characters [A-Za-z0-9_] ; and likewise, all the Posix classes such as [[:print:]] match only the appropriate ASCII-range characters.
The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a [] . It provides a range construct. The regex [a-z] will match any letter a through z.
There are three types of character classes in Perl regular expressions: the dot, backslash sequences, and the form enclosed in square brackets. Keep in mind, though, that often the term "character class" is used to mean just the bracketed form.
Regular Expressions in Grep (Regex) 1 Grep Regular Expression #. A regular expression or regex is a pattern that matches a set of strings. ... 2 Literal Matches #. The most basic usage of the grep command is to search for a literal character or series of characters in a file. 3 Anchoring #. ... 4 Bracket Expressions #. ...
You use square brackets [] to denote the character classes in regular expressions. Inside the square brackets, you put a set of characters that possibly matches. Take a look at the following example: The regular expression / [dfr]og/ matches dog, fog, rog, etc.
To make the regular expressions more readable, Perl provides useful predefined abbreviations for common character classes as shown below: d matches a digit, from 0 to 9 [0-9] s matches a whitespace character, that is a space, tab, newline, carriage return, formfeed. [tnrf] w matches a “word” character (alphanumeric or _) [0-9a-zA-Z_].
Both fails
are fixed with the addition of anchors ^
and $
and quantifier +
These both work:
my @fails1 = grep(/^[^befghjklmpqrsuvwxyz]+$/, @test_data);
my @fails2 = grep(/^[^@unwanted_letters]+$/, @test_data);
Keep in mind that /[^befghjklmpqrsuvwxyz]/
or /[^@unwanted_letters]/
only matches ONE character. Adding +
means as many as possible. Adding ^
and $
means all characters from the start to the end of the string.
With /[@wanted_letters]/
you will return a match if there is a single wanted character (even with unwanted characters in the string) -- the logical equivalent to any. Compare to /^[@wanted_letters]+$/
where all the letters need to be in the set of @wanted_letters
and is the equivalent of all.
Demo1 only ONE character so grep
fails.
Demo2 quantifier means more than one but no anchor - grep fails
Demo3 Anchors and quantifier - expected result.
Once you understand character classes only match ONE character and anchors for the WHOLE string and quantifiers for everything extending the match to the anchors, you can directly grep just with wanted letters:
my @wanted = grep(/^[@wanted_letters]+$/, @test_data);
You're matching something outside the character set anywhere in the string. But it can still have characters in the character set somewhere else in the string. For instance, if the test word is elephant
, the negated character class matches the a
character.
If you want to test the whole string, you need to quantify it and anchor to the ends.
grep(/^[^befghjklmpqrsuvwxyz]*$/, @test_data);
Translated into English, it's the difference between "word contains no characters in the set" and "word contains a character not in the set".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With