Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R gsub to extract emails from text

Tags:

regex

r

gsub

I have a variable a created by readLines of a file which contains some emails. I already filtered only those rows whith the @ symbol, and now am struggling to grab the emails. The text in my variable looks like this:

> dput(a[1:5])
c("buenas tardes. excelente. por favor a: [email protected]", 
"[email protected] ", "Aprecio tu aporte , mi correo es [email protected] , Muchas Gracias", 
"gracias [email protected]", "Me apunto, muchas gracias mi dirección [email protected] me será de mucha utilidad. "
)

From this question in SO I got a starting point to extract the emails (@Aaron Haurun's answer), which slightly modified (I added a [\w.] before the @ to address emails with . between names) worked well in regex101.com to extract the emails. However, it fails when I port it to gsub:

> gsub("()(\\w[\\w.]+@[\\w.-]+|\\{(?:\\w+, *)+\\w+\\}@[\\w.-]+)()", 
       "\\2", 
       a[1:5], 
       perl = FALSE) ## It doesn't matter if I use perl = TRUE

[1] "buenas tardes. excelente. por favor a: [email protected]"           "[email protected] "                                                                          
[3] "Aprecio tu aporte , mi correo es [email protected] , Muchas Gracias"                           "gracias [email protected]"                                                                       
[5] "Me apunto, muchas gracias mi dirección [email protected] me será de mucha utilidad. "

What am I doing wrong and how can I grab those emails? Thanks!

like image 289
PavoDive Avatar asked Dec 24 '22 05:12

PavoDive


2 Answers

We can try the str_extract() from stringr package:

str_extract(text, "\\S*@\\S*")

[1] "[email protected]"              
[2] "[email protected]"             
[3] "[email protected]"             
[4] "[email protected]"      
[5] "[email protected]"

where \\S* match any number of non-space character.

like image 110
Psidom Avatar answered Jan 02 '23 16:01

Psidom


From the answer you posted in your question,

library(stringr)
str_extract(a, '\\S+@\\S+|\\{(?:\\w+, *)+\\w+\\}@[\\w.-]+')
#[1] "[email protected]"               "[email protected]"              "[email protected]"              "[email protected]"      
#[5] "[email protected]"
like image 40
Sotos Avatar answered Jan 02 '23 18:01

Sotos