Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting everything between two symbols in a string

Tags:

regex

r

gsub

I have a vector containing some names. I want to extract the title on every row, basically everything between the ", " (included the white space) and "."

> head(combi$Name)
[1] "Braund, Mr. Owen Harris"
[2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
[3] "Heikkinen, Miss. Laina"
[4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
[5] "Allen, Mr. William Henry"
[6] "Moran, Mr. James"

I suppose gsub might come useful but I have difficulties on find the right regular expressions to accomplish my needs.

like image 338
Gianluca Avatar asked Feb 16 '14 15:02

Gianluca


People also ask

How do I extract text between two delimiters in Excel?

The easiest way to extract a substring between two delimiters is to use the text to column feature in Excel, especially if you have multiple delimiters. In this example, use =MID(A2, SEARCH(“-“,A2) + 1, SEARCH(“-“,A2,SEARCH(“-“,A2)+1) – SEARCH(“-“,A2) – 1) in cell B2 and drag it to the entire data range.

How extract all characters in a string?

To get text following a specific character, you use a slightly different approach: get the position of the character with either SEARCH or FIND, subtract that number from the total string length returned by the LEN function, and extract that many characters from the end of the string.

How do I extract part of a text string?

1. Select a cell that used to place the extracted substring, click Kutools > Formula Helper > Text > Extract strings between specified text. 2. In the Formulas Helper dialog, go to the Arguments input section, then select or directly type the cell reference and the two characters you want to extract between.

How do I extract a string between two characters in Google Sheets?

To extract the text between any characters, use a formula with the MID and FIND functions. The FIND Function locates the parenthesis and the MID Function returns the characters in between them.


2 Answers

1) sub With sub

> sub(".*, ([^.]*)\\..*", "\\1", Name)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

1a) sub variation This approach with gsub also works:

> sub(".*, |\\..*", "", Name)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

2) strapplyc or using strapplyc in the gusbfn package it can be done with a simpler regular expression:

> library(gsubfn)
>
> strapplyc(Name, ", ([^.]*)\\.", simplify = TRUE)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

2a) strapplyc variation This one seems to have the simplest regular expression of them all.

> library(gsubfn)
>
> sapply(strapplyc(Name, "\\w+"), "[", 2)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

3) strsplit A third way is using strsplit

> sapply(strsplit(Name, ", |\\."), "[", 2)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

Added additional solutions. Changed gsub to sub (although gsub works too).

like image 168
G. Grothendieck Avatar answered Oct 31 '22 01:10

G. Grothendieck


Not to note that there's anything lacking from G. Grothendieck's answer. I just want to add a solution using sub and non-greedy repetition:

vec <- c("Moran, Mr. James",
         "Rothschild, Mrs. Martin (Elizabeth L. Barrett)")

sub(".*, (.+?)\\..*", "\\1", vec)
# [1] "Mr"  "Mrs"

Another alternative with regexpr, regmatches, and lookbehind/lookahead:

regmatches(vec, regexpr("(?<=, ).+?(?=\\.)", vec, perl = TRUE))
# [1] "Mr"  "Mrs"
like image 23
Sven Hohenstein Avatar answered Oct 31 '22 02:10

Sven Hohenstein