I sometimes want to match whitespace but not newline.
So far I've been resorting to [ \t]
. Is there a less awkward way?
According to regex101.com \s : Matches any space, tab or newline character.
If you're looking for a space, that would be " " (one space). If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).
Space, tab, line feed (newline), carriage return, form feed, and vertical tab characters are called "white-space characters" because they serve the same purpose as the spaces between words and lines on a printed page — they make reading easier.
Regex uses backslash ( \ ) for two purposes: for metacharacters such as \d (digit), \D (non-digit), \s (space), \S (non-space), \w (word), \W (non-word). to escape special regex characters, e.g., \. for . , \+ for + , \* for * , \? for ? .
Use a double-negative:
/[^\S\r\n]/
That is, not-not-whitespace (the capital S complements) or not-carriage-return or not-newline. Distributing the outer not (i.e., the complementing ^
in the character class) with De Morgan's law, this is equivalent to “whitespace but not carriage return or newline.” Including both \r
and \n
in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and DOS-ish (CR LF) newline conventions.
No need to take my word for it:
#! /usr/bin/env perl use strict; use warnings; use 5.005; # for qr// my $ws_not_crlf = qr/[^\S\r\n]/; for (' ', '\f', '\t', '\r', '\n') { my $qq = qq["$_"]; printf "%-4s => %s\n", $qq, (eval $qq) =~ $ws_not_crlf ? "match" : "no match"; }
Output:
" " => match "\f" => match "\t" => match "\r" => no match "\n" => no match
Note the exclusion of vertical tab, but this is addressed in v5.18.
Before objecting too harshly, the Perl documentation uses the same technique. A footnote in the “Whitespace” section of perlrecharclass reads
Prior to Perl v5.18,
\s
did not match the vertical tab.[^\S\cK]
(obscurely) matches what\s
traditionally did.
The same section of perlrecharclass also suggests other approaches that won’t offend language teachers’ opposition to double-negatives.
Outside locale and Unicode rules or when the /a
switch is in effect, “\s
matches [\t\n\f\r ]
and, starting in Perl v5.18, the vertical tab, \cK
.” Discard \r
and \n
to leave /[\t\f\cK ]/
for matching whitespace but not newline.
If your text is Unicode, use code similar to the sub below to construct a pattern from the table in the aforementioned documentation section.
sub ws_not_nl { local($_) = <<'EOTable'; 0x0009 CHARACTER TABULATION h s 0x000a LINE FEED (LF) vs 0x000b LINE TABULATION vs [1] 0x000c FORM FEED (FF) vs 0x000d CARRIAGE RETURN (CR) vs 0x0020 SPACE h s 0x0085 NEXT LINE (NEL) vs [2] 0x00a0 NO-BREAK SPACE h s [2] 0x1680 OGHAM SPACE MARK h s 0x2000 EN QUAD h s 0x2001 EM QUAD h s 0x2002 EN SPACE h s 0x2003 EM SPACE h s 0x2004 THREE-PER-EM SPACE h s 0x2005 FOUR-PER-EM SPACE h s 0x2006 SIX-PER-EM SPACE h s 0x2007 FIGURE SPACE h s 0x2008 PUNCTUATION SPACE h s 0x2009 THIN SPACE h s 0x200a HAIR SPACE h s 0x2028 LINE SEPARATOR vs 0x2029 PARAGRAPH SEPARATOR vs 0x202f NARROW NO-BREAK SPACE h s 0x205f MEDIUM MATHEMATICAL SPACE h s 0x3000 IDEOGRAPHIC SPACE h s EOTable my $class; while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) { my($hex,$name) = ($1,$2); next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/; $class .= "\\N{U+$hex}"; } qr/[$class]/u; }
The double-negative trick is also handy for matching alphabetic characters too. Remember that \w
matches “word characters,” alphabetic characters and digits and underscore. We ugly-Americans sometimes want to write it as, say,
if (/[A-Za-z]+/) { ... }
but a double-negative character-class can respect the locale:
if (/[^\W\d_]+/) { ... }
Expressing “a word character but not digit or underscore” this way is a bit opaque. A POSIX character-class communicates the intent more directly
if (/[[:alpha:]]+/) { ... }
or with a Unicode property as szbalint suggested
if (/\p{Letter}+/) { ... }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With