I am trying to get all matches for a RegExp from a string but apparently it's not so easy in R, or I have overlooked something. Truth be told, it's really confusing and I found myself lost among all the options: str_extract
, str_match
, str_match_all
, regexec
, grep
, gregexpr
, and who knows how many others.
In reality, all I'm trying to accomplish is simply (in Python):
>>> import re
>>> re.findall(r'([\w\']+|[.,;:?!])', 'This is starting to get really, really annoying!!')
['This', 'is', 'starting', 'to', 'get', 'really', ',', 'really', 'annoying', '!', '!']
The problem of the functions mentioned above is that either they return one match, or they return no match at all.
In general, there is no R exact equivalent to Python re.findall
that either returns a list of match values or (a list of) tuples that hold capturing group submatches. The closest is str_match_all
from the stringr package, but it is also very close to the Python re.finditer
(as it returns the match value in the first item and then all submatches (capturing group contents) in the subsequent items (still not exact equivalent of re.finditer
as only texts are returned, not match data objects)). So, if the whole match value was not returned with str_match_all
, it would be an exact equivalent to Python re.findall
.
You are using re.findall
to just return matches, not captures, the capturing group in your pattern is redundant, and you may remove it. Thus, you can safely use regmatches
with gregexpr
and a PCRE flavor (since [\\w']
won't work with a TRE regex):
s <- "This is starting to get really, really annoying!!"
res <- regmatches(s, gregexpr("[\\w']+|[.,;:?!]", s, perl=TRUE))
## => [[1]]
[1] "This" "is" "starting" "to" "get" "really"
[7] "," "really" "annoying" "!" "!"
See the R demo
Or, to make \w
Unicode-aware, to make it work as in Python 3, add (*UCP)
PCRE verb:
res <- regmatches(s, gregexpr("(*UCP)[\\w']+|[.,;:?!]", s, perl=TRUE))
See another R demo
If you want to use stringr package (that uses ICU regex library behind the scenes), you need str_extract_all
:
res <- str_extract_all(s, "[\\w']+|[.,;:?!]")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With