Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regex beginning of line ^ in data frame values

Tags:

regex

r

Given:

    test <- data.frame(Speed=c("2 Mbps", "10 Mbps"))

Why does this regex match the following values:

    grepl("[0-9]*Mbps$", test[,"Speed"], ignore.case=TRUE)

but fails to match those below:

    grepl("^[0-9]*Mbps$", test[,"Speed"], ignore.case=TRUE)

The ^ (beginning of line/string) character is causing the issue, but why?

like image 428
Paulo S. Abreu Avatar asked May 26 '15 21:05

Paulo S. Abreu


2 Answers

The ^[0-9]*Mbps$ regex looks for a number at the beginning and then for Mbps at the end. And since there are spaces in-between, there is no match. To match the strings, use ^[0-9]*\\s*Mbps$.

test <- data.frame(Speed=c("2 Mbps", "10 Mbps"))
grepl("^[0-9]*\\s*Mbps$", test[,"Speed"], ignore.case=TRUE)

Output of the demo program:

[1] TRUE TRUE

[0-9]*Mbps$ matches just Mbps at the end of each item because the [0-9]* can match an empty string due to the * quantifier.

like image 85
Wiktor Stribiżew Avatar answered Sep 19 '22 00:09

Wiktor Stribiżew


Because a space is missing in the regex;

"^[0-9]* Mbps$" or "^[0-9]*\\s*Mbps$" would match the inputs.


"[0-9]*Mbps$" matches (not necessarily from the beginning of the string) "zero occurences of digit-characters, followed by 'Mbps' and end of string".

"^[0-9]*Mbps$" doesn't match the inputs, because it requires the input to start with zero-or-more digits, then 'Mbps' (no space!), then end of string.

like image 20
Alex Shesterov Avatar answered Sep 22 '22 00:09

Alex Shesterov