Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match whitespace but not newlines

Tags:

regex

perl

I sometimes want to match whitespace but not newline.

So far I've been resorting to [ \t]. Is there a less awkward way?

like image 633
JoelFan Avatar asked Aug 12 '10 15:08

JoelFan


People also ask

Does \s match newline regex?

According to regex101.com \s : Matches any space, tab or newline character.

How do you match a space in regex?

If you're looking for a space, that would be " " (one space). If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).

Is newline considered whitespace?

Space, tab, line feed (newline), carriage return, form feed, and vertical tab characters are called "white-space characters" because they serve the same purpose as the spaces between words and lines on a printed page — they make reading easier.

Is space a special character in regex?

Regex uses backslash ( \ ) for two purposes: for metacharacters such as \d (digit), \D (non-digit), \s (space), \S (non-space), \w (word), \W (non-word). to escape special regex characters, e.g., \. for . , \+ for + , \* for * , \? for ? .


1 Answers

Use a double-negative:

/[^\S\r\n]/ 

That is, not-not-whitespace (the capital S complements) or not-carriage-return or not-newline. Distributing the outer not (i.e., the complementing ^ in the character class) with De Morgan's law, this is equivalent to “whitespace but not carriage return or newline.” Including both \r and \n in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and DOS-ish (CR LF) newline conventions.

No need to take my word for it:

#! /usr/bin/env perl  use strict; use warnings;  use 5.005;  # for qr//  my $ws_not_crlf = qr/[^\S\r\n]/;  for (' ', '\f', '\t', '\r', '\n') {   my $qq = qq["$_"];   printf "%-4s => %s\n", $qq,     (eval $qq) =~ $ws_not_crlf ? "match" : "no match"; } 

Output:

" "  => match "\f" => match "\t" => match "\r" => no match "\n" => no match

Note the exclusion of vertical tab, but this is addressed in v5.18.

Before objecting too harshly, the Perl documentation uses the same technique. A footnote in the “Whitespace” section of perlrecharclass reads

Prior to Perl v5.18, \s did not match the vertical tab. [^\S\cK] (obscurely) matches what \s traditionally did.

The same section of perlrecharclass also suggests other approaches that won’t offend language teachers’ opposition to double-negatives.

Outside locale and Unicode rules or when the /a switch is in effect, “\s matches [\t\n\f\r ] and, starting in Perl v5.18, the vertical tab, \cK.” Discard \r and \n to leave /[\t\f\cK ]/ for matching whitespace but not newline.

If your text is Unicode, use code similar to the sub below to construct a pattern from the table in the aforementioned documentation section.

sub ws_not_nl {   local($_) = <<'EOTable'; 0x0009        CHARACTER TABULATION   h s 0x000a              LINE FEED (LF)    vs 0x000b             LINE TABULATION    vs  [1] 0x000c              FORM FEED (FF)    vs 0x000d        CARRIAGE RETURN (CR)    vs 0x0020                       SPACE   h s 0x0085             NEXT LINE (NEL)    vs  [2] 0x00a0              NO-BREAK SPACE   h s  [2] 0x1680            OGHAM SPACE MARK   h s 0x2000                     EN QUAD   h s 0x2001                     EM QUAD   h s 0x2002                    EN SPACE   h s 0x2003                    EM SPACE   h s 0x2004          THREE-PER-EM SPACE   h s 0x2005           FOUR-PER-EM SPACE   h s 0x2006            SIX-PER-EM SPACE   h s 0x2007                FIGURE SPACE   h s 0x2008           PUNCTUATION SPACE   h s 0x2009                  THIN SPACE   h s 0x200a                  HAIR SPACE   h s 0x2028              LINE SEPARATOR    vs 0x2029         PARAGRAPH SEPARATOR    vs 0x202f       NARROW NO-BREAK SPACE   h s 0x205f   MEDIUM MATHEMATICAL SPACE   h s 0x3000           IDEOGRAPHIC SPACE   h s EOTable    my $class;   while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) {     my($hex,$name) = ($1,$2);     next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/;     $class .= "\\N{U+$hex}";   }    qr/[$class]/u; } 

Other Applications

The double-negative trick is also handy for matching alphabetic characters too. Remember that \w matches “word characters,” alphabetic characters and digits and underscore. We ugly-Americans sometimes want to write it as, say,

if (/[A-Za-z]+/) { ... } 

but a double-negative character-class can respect the locale:

if (/[^\W\d_]+/) { ... } 

Expressing “a word character but not digit or underscore” this way is a bit opaque. A POSIX character-class communicates the intent more directly

if (/[[:alpha:]]+/) { ... } 

or with a Unicode property as szbalint suggested

if (/\p{Letter}+/) { ... } 
like image 145
Greg Bacon Avatar answered Sep 21 '22 03:09

Greg Bacon