I have a large data frame with a column of string data that currently contains a set of names and in certain instances an email address. I'd like to find a regular expression that will allow me to set the position just before the second comma in those cases with an email address and then remove what comes after it so that I am left with an "author" column of just names, no emails included.
> author<-c("Doe, Jane", "Smith, John", "Doe, John, [email protected]", "Smith, Jane")
> ID<- c(1:4)
> df<-cbind(author, ID)
> df
author ID
[1,] Doe, Jane 1
[2,] Smith, John 2
[3,] Doe, John, [email protected] 3
[4,] Smith, Jane 4
I'd like the output to look as follows
>df
author ID
[1,] Doe, Jane 1
[2,] Smith, John 2
[3,] Doe, John 3
[4,] Smith, Jane 4
Use sub
function. [^,]*
matches any character but not of ,
zero or more times.
> author<-c("Doe, Jane", "Smith, John", "Doe, John, [email protected]", "Smith, Jane")
> sub("^([^,]*,[^,]*),.*", "\\1", author)
[1] "Doe, Jane" "Smith, John" "Doe, John" "Smith, Jane"
> ID<- c(1:4)
> df<-cbind(author=sub("^([^,]*,[^,]*),.*", "\\1", author), ID)
> df
author ID
[1,] "Doe, Jane" "1"
[2,] "Smith, John" "2"
[3,] "Doe, John" "3"
[4,] "Smith, Jane" "4"
Explanation:
^
Asserts that we are at the start.([^,]*,[^,]*)
, (...)
called Capturing group used to capture those characters which are matched by the pattern present inside that capturing group. In our case , the pattern present inside the capturing group is [^,]*,[^,]*
. I already mentioned that this [^,]*
matches any character but not of comma, zero or more times. So [^,]*,[^,]*
matches all the characters from the start until the second comma is reached. ([^,]*,[^,]*)
captures those matched characters and stored it into the group index 1. We could refer the characters which are present inside the capturing group by specifying it's index number. This is called back-referencing. ,.*
Now this matches the second comma and the following zero or more characters.sub
and gsub
functions will replace all the matched characters with the string mentioned in the replacement part. So in our case, all the matched characters are replaced by the chars inside group index 1. That's why we used \\1
in the replacement part.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With