R - using regex, set position before nth punct in string and delete what follows

Question

I have a large data frame with a column of string data that currently contains a set of names and in certain instances an email address. I'd like to find a regular expression that will allow me to set the position just before the second comma in those cases with an email address and then remove what comes after it so that I am left with an "author" column of just names, no emails included.

> author<-c("Doe, Jane", "Smith, John", "Doe, John, johndoe@xyz.net", "Smith, Jane")
> ID<- c(1:4)   
> df<-cbind(author, ID)

> df

  author                         ID 
[1,] Doe, Jane                   1
[2,] Smith, John                 2
[3,] Doe, John, johndoe@xyz.net  3
[4,] Smith, Jane                 4

I'd like the output to look as follows

>df

author                            ID 
[1,] Doe, Jane                    1
[2,] Smith, John                  2
[3,] Doe, John                    3
[4,] Smith, Jane                  4

Avinash Raj · Accepted Answer

Use sub function. [^,]* matches any character but not of , zero or more times.

> author<-c("Doe, Jane", "Smith, John", "Doe, John, johndoe@xyz.net", "Smith, Jane")
> sub("^([^,]*,[^,]*),.*", "\1", author)
[1] "Doe, Jane"   "Smith, John" "Doe, John"   "Smith, Jane"
> ID<- c(1:4)
> df<-cbind(author=sub("^([^,]*,[^,]*),.*", "\1", author), ID)
> df
     author        ID 
[1,] "Doe, Jane"   "1"
[2,] "Smith, John" "2"
[3,] "Doe, John"   "3"
[4,] "Smith, Jane" "4"

Explanation:

^ Asserts that we are at the start.
([^,]*,[^,]*), (...) called Capturing group used to capture those characters which are matched by the pattern present inside that capturing group. In our case , the pattern present inside the capturing group is [^,]*,[^,]*. I already mentioned that this [^,]* matches any character but not of comma, zero or more times. So [^,]*,[^,]* matches all the characters from the start until the second comma is reached. ([^,]*,[^,]*) captures those matched characters and stored it into the group index 1. We could refer the characters which are present inside the capturing group by specifying it's index number. This is called back-referencing.
,.* Now this matches the second comma and the following zero or more characters.
sub and gsub functions will replace all the matched characters with the string mentioned in the replacement part. So in our case, all the matched characters are replaced by the chars inside group index 1. That's why we used \1 in the replacement part.

R - using regex, set position before nth punct in string and delete what follows

Tags:

string

regex

r

punctuation

gsub

B Victor

1 Answers

Avinash Raj

Recent Activity

Donate For Us

R - using regex, set position before nth punct in string and delete what follows

Tags:

string

regex

r

punctuation

gsub

B Victor

1 Answers

Avinash Raj

Related questions

Recent Activity

Donate For Us