Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - using regex, set position before nth punct in string and delete what follows

I have a large data frame with a column of string data that currently contains a set of names and in certain instances an email address. I'd like to find a regular expression that will allow me to set the position just before the second comma in those cases with an email address and then remove what comes after it so that I am left with an "author" column of just names, no emails included.

> author<-c("Doe, Jane", "Smith, John", "Doe, John, [email protected]", "Smith, Jane")
> ID<- c(1:4)   
> df<-cbind(author, ID)

> df

  author                         ID 
[1,] Doe, Jane                   1
[2,] Smith, John                 2
[3,] Doe, John, [email protected]  3
[4,] Smith, Jane                 4

I'd like the output to look as follows

>df

author                            ID 
[1,] Doe, Jane                    1
[2,] Smith, John                  2
[3,] Doe, John                    3
[4,] Smith, Jane                  4
like image 405
B Victor Avatar asked Mar 17 '23 04:03

B Victor


1 Answers

Use sub function. [^,]* matches any character but not of , zero or more times.

> author<-c("Doe, Jane", "Smith, John", "Doe, John, [email protected]", "Smith, Jane")
> sub("^([^,]*,[^,]*),.*", "\\1", author)
[1] "Doe, Jane"   "Smith, John" "Doe, John"   "Smith, Jane"
> ID<- c(1:4)
> df<-cbind(author=sub("^([^,]*,[^,]*),.*", "\\1", author), ID)
> df
     author        ID 
[1,] "Doe, Jane"   "1"
[2,] "Smith, John" "2"
[3,] "Doe, John"   "3"
[4,] "Smith, Jane" "4"

Explanation:

  • ^ Asserts that we are at the start.
  • ([^,]*,[^,]*), (...) called Capturing group used to capture those characters which are matched by the pattern present inside that capturing group. In our case , the pattern present inside the capturing group is [^,]*,[^,]*. I already mentioned that this [^,]* matches any character but not of comma, zero or more times. So [^,]*,[^,]* matches all the characters from the start until the second comma is reached. ([^,]*,[^,]*) captures those matched characters and stored it into the group index 1. We could refer the characters which are present inside the capturing group by specifying it's index number. This is called back-referencing.
  • ,.* Now this matches the second comma and the following zero or more characters.
  • sub and gsub functions will replace all the matched characters with the string mentioned in the replacement part. So in our case, all the matched characters are replaced by the chars inside group index 1. That's why we used \\1 in the replacement part.
like image 110
Avinash Raj Avatar answered Mar 20 '23 03:03

Avinash Raj