Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r regex weird behavior

Tags:

regex

r

I'm trying to get the location of a white space inside a string but I don't understand the results.

Given the string:

a = "12345,1300 miles"

> gregexpr("\\s", a)
[[1]]
[1] 11
attr(,"match.length")
[1] 1

This makes sense b/c the white space is in index 11 of the string.

> gregexpr("[\\s]", a)
[[1]]
[1] 16
attr(,"match.length")
[1] 1

This does not make sense to me b/c index 16 is simply the end of the string. There is no white space there, and I'm wondering why it skipped index 11.

I'm stumped, can anyone give an explanation on why this is happening?

> gregexpr("\\s*", a)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
attr(,"match.length")
 [1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

This also does not make sense to me b/c the white space matched every single character in the string.

like image 527
Paolo Avatar asked Feb 18 '26 05:02

Paolo


2 Answers

Inside character classes you should probably not be using escaped regex sequences. They are not recognized properly. I do not know if this is proper regex behavior, but there is a sentence in the ?regex page saying: "Most metacharacters lose their special meaning inside a character class. " I can successfully use [:space:] instead

> grep("[\\s]", "ttt rrr a vvv")
integer(0)
> grep("[[:space:]]", "ttt rrr a vvv")
[1] 1

In the second instance it is true that all of those substrings will match that pattern. The behavior of this code is perhaps what you expected:

gregexpr("\\s.*", a)
[[1]]
[1] 11
attr(,"match.length")
[1] 6
attr(,"useBytes")
[1] TRUE

Or:

gregexpr("\\s+", a)
[[1]]
[1] 11
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE
like image 181
IRTFM Avatar answered Feb 20 '26 23:02

IRTFM


I can explain you the behaviour for the \s* case. The quantifier * matches 0 or more occurrences. This 0 means it matches if it does not find a whitespace:

12345,1300 miles

Your regex \s* see the first character "1" ==> there is no \s, so it matches 0 occurrences, means it MATCHES with length 0

Then it goes on to the second character "2" ==> there is no \s, so it matches 0 occurrences, means it MATCHES with length 0

On the third character ....

This regex does not match "every single character in the string" it matches the empty string between those characters.

like image 43
stema Avatar answered Feb 20 '26 21:02

stema



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!