I am trying to isolate words from a string in R using -gsub-. I want to extract a name that can be found between either "(" and "(m)" (for males) or between "(" and "(f)". I am struggling to incorporate in one line of code.
name<-c("Dr. T. (Tom) Bailey (m), UCL- Physics" , "Dr. B.K. (Barbara) Blue (f), Oxford - Political Science")
malename<-gsub(".*\\) (.*) \\(m).*", "\\1", name)
femname<-gsub(".*\\) (.*) \\(f).*", "\\1", name)
The code above gives me the names for males and females separately, but ideally I want to obtain their lastname in one variable. This would involve some OR function (so (m) OR (f)), but I don't know how to incorporate this.
If you need to match either m
or f
, the best way to match them is a character class (or, in POSIX terminology, a bracket expression): [mf]
.
Your regex will look like
".*\\)\\s+(.*)\\s+\\([mf]\\).*"
^^^^
See the regex demo
You may use the regex with sub
to make sure only one regex match and replacement are performed (see online demo):
name<-c("Dr. T. (Tom) Bailey (m), UCL- Physics" , "Dr. B.K. (Barbara) Blue (f), Oxford - Political Science")
res <- sub(".*\\)\\s+(.*)\\s+\\([mf]\\).*", "\\1", name)
res
## => [1] "Bailey" "Blue"
Try with sub
sub("^[^)]+\\)\\s+(\\w+).*", "\\1", name)
#[1] "Bailey" "Blue"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With