Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R beginning match count

Tags:

regex

r

I am using R and have the following string below:

s <- "\t\t\t   \t\t\thello    world   !  \t\t\thello"

I want to get the match count of whitespaces at the start of the string only, not anywhere else. So the spaces between the content should be ignored and only the start should be counted. The result would be "9" here.

I have tried the following but it only returns a count of "1" ...

sapply(regmatches(s, gregexpr('^(\\s)', s)), length)

I am not very good at regex, any help is appreciated.

like image 344
chaz Avatar asked Jan 13 '15 05:01

chaz


2 Answers

For matching the first occurrence, regexpr() would be more appropriate than gregexpr(). As a result of that switch, sapply() will no longer be necessary because regexpr() returns an atomic vector whereas gregexpr() returns a list.

You could use the following regular expression, looking at the match.length attribute from the result of regexpr().

attr(regexpr("^\\s+", s), "match.length")
# [1] 9

Explanation of the regular expression:

  • ^ Force the regex to be at the beginning of the string.
  • \\s Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
  • + The preceding item will be matched one or more times.

Reference: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

like image 149
Rich Scriven Avatar answered Oct 05 '22 23:10

Rich Scriven


Another way you can solve this is anchoring with \G. The \G feature is an anchor that can match at one of two positions; the beginning of the string, or the point where the last character of last match is consumed.

sapply(gregexpr("\\G\\s", s, perl = TRUE), length)
# [1] 9
like image 40
hwnd Avatar answered Oct 05 '22 23:10

hwnd