Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R's equivalent of Python's re.findall

Tags:

python

regex

r

I am trying to get all matches for a RegExp from a string but apparently it's not so easy in R, or I have overlooked something. Truth be told, it's really confusing and I found myself lost among all the options: str_extract, str_match, str_match_all, regexec, grep, gregexpr, and who knows how many others.

In reality, all I'm trying to accomplish is simply (in Python):

>>> import re
>>> re.findall(r'([\w\']+|[.,;:?!])', 'This is starting to get really, really annoying!!')
['This', 'is', 'starting', 'to', 'get', 'really', ',', 'really', 'annoying', '!', '!']

The problem of the functions mentioned above is that either they return one match, or they return no match at all.

like image 645
rubik Avatar asked Apr 13 '17 20:04

rubik


1 Answers

In general, there is no R exact equivalent to Python re.findall that either returns a list of match values or (a list of) tuples that hold capturing group submatches. The closest is str_match_all from the stringr package, but it is also very close to the Python re.finditer (as it returns the match value in the first item and then all submatches (capturing group contents) in the subsequent items (still not exact equivalent of re.finditer as only texts are returned, not match data objects)). So, if the whole match value was not returned with str_match_all, it would be an exact equivalent to Python re.findall.

You are using re.findall to just return matches, not captures, the capturing group in your pattern is redundant, and you may remove it. Thus, you can safely use regmatches with gregexpr and a PCRE flavor (since [\\w'] won't work with a TRE regex):

s <- "This is starting to get really, really annoying!!"
res <- regmatches(s, gregexpr("[\\w']+|[.,;:?!]", s, perl=TRUE))
## => [[1]]
[1] "This"     "is"      "starting" "to"       "get"      "really"  
[7] ","        "really"   "annoying" "!"        "!"  

See the R demo

Or, to make \w Unicode-aware, to make it work as in Python 3, add (*UCP) PCRE verb:

res <- regmatches(s, gregexpr("(*UCP)[\\w']+|[.,;:?!]", s, perl=TRUE))

See another R demo

If you want to use stringr package (that uses ICU regex library behind the scenes), you need str_extract_all:

res <- str_extract_all(s, "[\\w']+|[.,;:?!]")
like image 137
Wiktor Stribiżew Avatar answered Sep 22 '22 19:09

Wiktor Stribiżew