Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regular expressions: unexpected behavior of "[:digit:]"

Tags:

regex

r

I'd like to extract elements beginning with digits from a character vector but there's something about POSIX regular expression syntax that I don't understand.

I would think that

vec <- c("012 foo", "305 bar", "other", "notIt 7") grep(pattern="[:digit:]", x=vec) 

would return 1 2 4 since they are the four elements that have digits somewhere in them. But in fact it returns 3 4.

Likewise grep(pattern="^0", x=vec) returns 1 as I would expect because element 1 starts with a zero. However grep(pattern="^[:digit:]", x=vec) returns integer(0) whereas I would expect it to return 1 2 since those are the elements that start with digits.

How am I misunderstanding the syntax?

like image 371
Drew Steen Avatar asked Jul 17 '12 15:07

Drew Steen


People also ask

What is ?= * In regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

How do you represent a digit in regex?

\d (digit) matches any single digit (same as [0-9] ). The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9] ). \s (space) matches any single whitespace (same as [ \t\n\r\f] , blank, tab, newline, carriage-return and form-feed).

What does * do in regular expression?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern.

What does the *@ regular expression pattern mean?

Regular expressions (shortened as "regex") are special strings representing a pattern to be matched in a search operation. They are an important tool in a wide variety of computing applications, from programming languages like Java and Perl, to text processing tools like grep, sed, and the text editor vim.


2 Answers

Try

grep(pattern="[[:digit:]]", x=vec) 

instead as the 'meta-patterns' between colons usually require double brackets.

like image 157
Dirk Eddelbuettel Avatar answered Oct 15 '22 00:10

Dirk Eddelbuettel


Another solution

grep(pattern="\\d", x=vec) 
like image 33
Wojciech Sobala Avatar answered Oct 14 '22 23:10

Wojciech Sobala