I have a vector containing some names. I want to extract the title on every row, basically everything between the ", " (included the white space) and "."
> head(combi$Name)
[1] "Braund, Mr. Owen Harris"
[2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
[3] "Heikkinen, Miss. Laina"
[4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
[5] "Allen, Mr. William Henry"
[6] "Moran, Mr. James"
I suppose gsub
might come useful but I have difficulties on find the right regular expressions to accomplish my needs.
The easiest way to extract a substring between two delimiters is to use the text to column feature in Excel, especially if you have multiple delimiters. In this example, use =MID(A2, SEARCH(“-“,A2) + 1, SEARCH(“-“,A2,SEARCH(“-“,A2)+1) – SEARCH(“-“,A2) – 1) in cell B2 and drag it to the entire data range.
To get text following a specific character, you use a slightly different approach: get the position of the character with either SEARCH or FIND, subtract that number from the total string length returned by the LEN function, and extract that many characters from the end of the string.
1. Select a cell that used to place the extracted substring, click Kutools > Formula Helper > Text > Extract strings between specified text. 2. In the Formulas Helper dialog, go to the Arguments input section, then select or directly type the cell reference and the two characters you want to extract between.
To extract the text between any characters, use a formula with the MID and FIND functions. The FIND Function locates the parenthesis and the MID Function returns the characters in between them.
1) sub With sub
> sub(".*, ([^.]*)\\..*", "\\1", Name)
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
1a) sub variation This approach with gsub
also works:
> sub(".*, |\\..*", "", Name)
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
2) strapplyc or using strapplyc
in the gusbfn package it can be done with a simpler regular expression:
> library(gsubfn)
>
> strapplyc(Name, ", ([^.]*)\\.", simplify = TRUE)
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
2a) strapplyc variation This one seems to have the simplest regular expression of them all.
> library(gsubfn)
>
> sapply(strapplyc(Name, "\\w+"), "[", 2)
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
3) strsplit A third way is using strsplit
> sapply(strsplit(Name, ", |\\."), "[", 2)
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
Added additional solutions. Changed gsub
to sub
(although gsub
works too).
Not to note that there's anything lacking from G. Grothendieck's answer. I just want to add a solution using sub
and non-greedy repetition:
vec <- c("Moran, Mr. James",
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)")
sub(".*, (.+?)\\..*", "\\1", vec)
# [1] "Mr" "Mrs"
Another alternative with regexpr
, regmatches
, and lookbehind/lookahead:
regmatches(vec, regexpr("(?<=, ).+?(?=\\.)", vec, perl = TRUE))
# [1] "Mr" "Mrs"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With