R regex find last occurrence of delimiter

Question

I'm trying to get the ending for email addresses (ie .net, .com, .edu, etc.) but the portion after the @ can have multiple periods.

library(stringi)

strings1 <- c(
    'test@aol.com',
    'test@hotmail.com',
    'test@xyz.rr.edu',
    'test@abc.xx.zz.net'
)

list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))

    > list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[[1]]
[1] "aol.com"

[[2]]
[1] "hotmail.com"

[[3]]
[1] "xyz.rr.edu"

[[4]]
[1] "abc.xx.zz.net"

Any suggestions to get something like this:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

EDIT: Another attempt:

> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."

G. Grothendieck · Accepted Answer

Here are a few approaches. The first seems particularly straight foward and the second particularly short.

1) sub That can be done with a an application of sub in R to produce each column:

data.frame(X1 = sub("@.*", "", strings1), 
           X2 = sub(".*@", "", strings1), 
           X3 = sub(".*[.]", "", strings1), 
           stringsAsFactors = FALSE)

giving:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

2) strapplyc Here is an alternative using the gsubfn package that is particularly short. This returns a character matrix. strappylyc returns the matches to the portions of the pattern in parentheses. The first set of parantheses matches everything before @, the second set of parentheses matches everything after @ and the last set of parentheses matches everything after the last dot.

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))

     [,1]   [,2]            [,3] 
[1,] "test" "aol.com"       "com"
[2,] "test" "hotmail.com"   "com"
[3,] "test" "xyz.rr.edu"    "edu"
[4,] "test" "abc.xx.zz.net" "net"

2a) read.pattern read.pattern also in the gsubfn package can do it using the same pat defined in (2):

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)

giving a data.frame similar to (1) except the column names are V1, V2 and V3.

3) strsplit The overlapping extractions make it difficult to do with strsplit but we can do it with two applications of strsplit. The first strsplit splits at the @ and the second uses everything up to the last dot to split on. This last strsplit always produces an empty string as the first split string and we delete this using [, -1]. This gives a character matrix:

 ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
 cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )

giving the same answer as (2).

4) strsplit/sub This is a mix of (1) and (3):

cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))

giving the same answer as (2).

4a) This is another way to use strsplit and sub. Here we append a @ followed by the TLD and then split on @.

do.call(rbind, strsplit(sub("(.*[.](.*))", "\1@\2", strings1), "@"))

giving the same answer as (2).

Update Added additional solutions.

Tyler Rinker · Answer

A read.table + file_ext approach (not regex but pretty easy):

dat <- read.table(text=strings1, sep="@")
dat$V3 <- tools::file_ext(strings1)
dat

##     V1            V2  V3
## 1 test       aol.com com
## 2 test   hotmail.com com
## 3 test    xyz.rr.edu edu
## 4 test abc.xx.zz.net net

Here's a purely regex approach:

do.call(rbind, strsplit(strings1, "@|\.(?=[^\.]+$)", perl=TRUE))

##     [,1]   [,2]        [,3] 
## [1,] "test" "aol"       "com"
## [2,] "test" "hotmail"   "com"
## [3,] "test" "xyz.rr"    "edu"
## [4,] "test" "abc.xx.zz" "net"

Jay · Answer

So this is a negate lookahead regex that should give you the last .word of that line.

\.(?!.*\.)\w+

R regex find last occurrence of delimiter

Tags:

string

regex

r

screechOwl

3 Answers

G. Grothendieck

Tyler Rinker

Jay

Recent Activity

Donate For Us

R regex find last occurrence of delimiter

Tags:

string

regex

r

screechOwl

3 Answers

G. Grothendieck

Tyler Rinker

Jay

Related questions

Recent Activity

Donate For Us