Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract e-mail address from string using r

Tags:

string

regex

r

perl

These are 5 twitter user descriptions. The idea is to extract the e-mail from each string.

This is the code i've tried, it works but there is probably something better. I'd rather avoid using unlist() and do it in one go using regex. I've seen other questions of the kind for python/perl/php but not for R. I know i could use grep(..., perl = TRUE) but that should't be the only way to do it. If it works, of course it helps.

ds <- c("#MillonMusical | #PromotorMusical | #Diseñador | Contacto :        [email protected] | #Instagram : Ezeqielgram | 01-11-11 |           @_MillonMusical @flowfestar", "LipGLosSTudio by: SAndry RUbio           Maquilladora PRofesional estudiande de diseño profesional de maquillaje     artistico [email protected]/", "Medico General Barranquillero   radicado con su familia en Buenos Aires para iniciar Especialidad       Medico Quirurgica. email [email protected]", "msn =
    [email protected] = ronaldotorres-br", "Aguante piscis /       [email protected]  buenos aires"
    )

ds <- unlist(strsplit(ds, ' '))
ds <- ds[grep("mail.", ds)]

> print(ds)
[1] "\t\[email protected]"  "[email protected]/"
[3] "[email protected]"       "[email protected]"
[5] "/\t\[email protected]"

It would be nice to separate this one "[email protected]" perhaps asking it to end in .com or .com.ar that would make sense for what i'm working on

like image 589
marbel Avatar asked Nov 10 '13 23:11

marbel


1 Answers

Here's one alternative:

> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com", ds))
[1] "[email protected]"     "[email protected]" "[email protected]"      "[email protected]"    
[5] "[email protected]" 

Based on @Frank's comment, if you want to keep country identifier after .com as in your example .com.ar then, look at this:

> ds <- c(ds, "[email protected]")  # a new e-mail address
> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com(\\.[a-z]{2})?", ds))
[1] "[email protected]"      "[email protected]"  "[email protected]"       "[email protected]"     
[5] "[email protected]"      "[email protected]"
like image 88
Jilber Urbina Avatar answered Sep 23 '22 13:09

Jilber Urbina