Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract these multiple regex groups in R

I have string inputs in the following format:

my.strings <- c("FACT11", "FACT11:FACT20", "FACT1sometext:FACT20", "FACT1text with spaces:FACT20", "FACT14:FACT20", "FACT1textAnd1312:FACT2etc", "FACT12:FACT22:FACT31")

I would like to extract all the "FACT"s and the first number following FACT. So the result from this example would be:

c("FACT1", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2 FACT3")

Alternatively, the result could be a list, where each element of the list is a vector with 1 up to 3 items.

What I got so far is:

gsub("(FACT[1-3]).*?:(FACT[1-3]).*", '\\1 \\2', my.strings)
# [1] "FACT11"       "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 "
# [7] "FACT1 FACT2 " "FACT1 FACT2 "

It kinda looks good, except for the "FACT11" for the first element instead of "FACT1" (dropping the second "1"), and missing the "FACT3" for the last element of my.strings. But adding another group to gsub somehow messes the whole thing up.

gsub("(FACT[1-3]).*?:(FACT[1-3]).*?:(FACT[1-3]).*?", '\\1 \\2 \\3', my.strings)
# [1] "FACT11"                       "FACT11:FACT20"                "FACT1sometext:FACT20"        
# [4] "FACT1text with spaces:FACT20" "FACT14:FACT20"                "FACT1textAnd1312:FACT2etc"   
# [7] "FACT12:FACT21"                "FACT1 FACT2 FACT31" 

So how can I properly extract the groups?

like image 468
bobbel Avatar asked Jan 03 '23 00:01

bobbel


1 Answers

You may use a base R approach, too:

> m <- regmatches(my.strings, gregexpr("FACT[1-3]", my.strings))
> sapply(m, paste, collapse=" ")
[1] "FACT1"            
[2] "FACT1 FACT2"      
[3] "FACT1 FACT2"      
[4] "FACT1 FACT2"      
[5] "FACT1 FACT2"      
[6] "FACT1 FACT2"      
[7] "FACT1 FACT2 FACT3"

Extract all matches with your FACT[1-3] (or FACT[0-9], or FACT\\d) pattern, and then "join" them with a space.

like image 186
Wiktor Stribiżew Avatar answered Jan 11 '23 10:01

Wiktor Stribiżew