These are 5 twitter user descriptions. The idea is to extract the e-mail from each string.
This is the code i've tried, it works but there is probably something better. I'd rather avoid using unlist() and do it in one go using regex. I've seen other questions of the kind for python/perl/php but not for R. I know i could use grep(..., perl = TRUE) but that should't be the only way to do it. If it works, of course it helps.
ds <- c("#MillonMusical | #PromotorMusical | #Diseñador | Contacto : [email protected] | #Instagram : Ezeqielgram | 01-11-11 | @_MillonMusical @flowfestar", "LipGLosSTudio by: SAndry RUbio Maquilladora PRofesional estudiande de diseño profesional de maquillaje artistico [email protected]/", "Medico General Barranquillero radicado con su familia en Buenos Aires para iniciar Especialidad Medico Quirurgica. email [email protected]", "msn =
[email protected] = ronaldotorres-br", "Aguante piscis / [email protected] buenos aires"
)
ds <- unlist(strsplit(ds, ' '))
ds <- ds[grep("mail.", ds)]
> print(ds)
[1] "\t\[email protected]" "[email protected]/"
[3] "[email protected]" "[email protected]"
[5] "/\t\[email protected]"
It would be nice to separate this one "[email protected]" perhaps asking it to end in .com or .com.ar that would make sense for what i'm working on
Here's one alternative:
> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com", ds))
[1] "[email protected]" "[email protected]" "[email protected]" "[email protected]"
[5] "[email protected]"
Based on @Frank's comment, if you want to keep country identifier after .com
as in your example .com.ar
then, look at this:
> ds <- c(ds, "[email protected]") # a new e-mail address
> regmatches(ds, regexpr("[[:alnum:]]+\\@[[:alpha:]]+\\.com(\\.[a-z]{2})?", ds))
[1] "[email protected]" "[email protected]" "[email protected]" "[email protected]"
[5] "[email protected]" "[email protected]"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With