I'm trying to get the ending for email addresses (ie .net, .com, .edu, etc.) but the portion after the @ can have multiple periods.
library(stringi)
strings1 <- c(
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]'
)
list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))
> list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[[1]]
[1] "aol.com"
[[2]]
[1] "hotmail.com"
[[3]]
[1] "xyz.rr.edu"
[[4]]
[1] "abc.xx.zz.net"
Any suggestions to get something like this:
X1 X2 X3
1 test aol.com com
2 test hotmail.com com
3 test xyz.rr.edu edu
4 test abc.xx.zz.net net
EDIT: Another attempt:
> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."
Here are a few approaches. The first seems particularly straight foward and the second particularly short.
1) sub That can be done with a an application of sub
in R to produce each column:
data.frame(X1 = sub("@.*", "", strings1),
X2 = sub(".*@", "", strings1),
X3 = sub(".*[.]", "", strings1),
stringsAsFactors = FALSE)
giving:
X1 X2 X3
1 test aol.com com
2 test hotmail.com com
3 test xyz.rr.edu edu
4 test abc.xx.zz.net net
2) strapplyc Here is an alternative using the gsubfn package that is particularly short. This returns a character matrix. strappylyc
returns the matches to the portions of the pattern in parentheses. The first set of parantheses matches everything before @, the second set of parentheses matches everything after @ and the last set of parentheses matches everything after the last dot.
library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))
[,1] [,2] [,3]
[1,] "test" "aol.com" "com"
[2,] "test" "hotmail.com" "com"
[3,] "test" "xyz.rr.edu" "edu"
[4,] "test" "abc.xx.zz.net" "net"
2a) read.pattern read.pattern
also in the gsubfn package can do it using the same pat
defined in (2):
library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)
giving a data.frame similar to (1) except the column names are V1
, V2
and V3
.
3) strsplit The overlapping extractions make it difficult to do with strsplit
but we can do it with two applications of strsplit
. The first strsplit
splits at the @ and the second uses everything up to the last dot to split on. This last strsplit
always produces an empty string as the first split string and we delete this using [, -1]
. This gives a character matrix:
ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )
giving the same answer as (2).
4) strsplit/sub This is a mix of (1) and (3):
cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))
giving the same answer as (2).
4a) This is another way to use strsplit
and sub
. Here we append a @ followed by the TLD and then split on @.
do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))
giving the same answer as (2).
Update Added additional solutions.
A read.table
+ file_ext
approach (not regex but pretty easy):
dat <- read.table(text=strings1, sep="@")
dat$V3 <- tools::file_ext(strings1)
dat
## V1 V2 V3
## 1 test aol.com com
## 2 test hotmail.com com
## 3 test xyz.rr.edu edu
## 4 test abc.xx.zz.net net
Here's a purely regex approach:
do.call(rbind, strsplit(strings1, "@|\\.(?=[^\\.]+$)", perl=TRUE))
## [,1] [,2] [,3]
## [1,] "test" "aol" "com"
## [2,] "test" "hotmail" "com"
## [3,] "test" "xyz.rr" "edu"
## [4,] "test" "abc.xx.zz" "net"
So this is a negate lookahead regex
that should give you the last .word
of that line.
\.(?!.*\.)\w+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With