Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regex find last occurrence of delimiter

Tags:

string

regex

r

I'm trying to get the ending for email addresses (ie .net, .com, .edu, etc.) but the portion after the @ can have multiple periods.

library(stringi)

strings1 <- c(
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]'
)

list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))

    > list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[[1]]
[1] "aol.com"

[[2]]
[1] "hotmail.com"

[[3]]
[1] "xyz.rr.edu"

[[4]]
[1] "abc.xx.zz.net"

Any suggestions to get something like this:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

EDIT: Another attempt:

> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."
like image 457
screechOwl Avatar asked Oct 09 '14 23:10

screechOwl


3 Answers

Here are a few approaches. The first seems particularly straight foward and the second particularly short.

1) sub That can be done with a an application of sub in R to produce each column:

data.frame(X1 = sub("@.*", "", strings1), 
           X2 = sub(".*@", "", strings1), 
           X3 = sub(".*[.]", "", strings1), 
           stringsAsFactors = FALSE)

giving:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

2) strapplyc Here is an alternative using the gsubfn package that is particularly short. This returns a character matrix. strappylyc returns the matches to the portions of the pattern in parentheses. The first set of parantheses matches everything before @, the second set of parentheses matches everything after @ and the last set of parentheses matches everything after the last dot.

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))

     [,1]   [,2]            [,3] 
[1,] "test" "aol.com"       "com"
[2,] "test" "hotmail.com"   "com"
[3,] "test" "xyz.rr.edu"    "edu"
[4,] "test" "abc.xx.zz.net" "net"

2a) read.pattern read.pattern also in the gsubfn package can do it using the same pat defined in (2):

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)

giving a data.frame similar to (1) except the column names are V1, V2 and V3.

3) strsplit The overlapping extractions make it difficult to do with strsplit but we can do it with two applications of strsplit. The first strsplit splits at the @ and the second uses everything up to the last dot to split on. This last strsplit always produces an empty string as the first split string and we delete this using [, -1]. This gives a character matrix:

 ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
 cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )

giving the same answer as (2).

4) strsplit/sub This is a mix of (1) and (3):

cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))

giving the same answer as (2).

4a) This is another way to use strsplit and sub. Here we append a @ followed by the TLD and then split on @.

do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))

giving the same answer as (2).

Update Added additional solutions.

like image 52
G. Grothendieck Avatar answered Nov 13 '22 20:11

G. Grothendieck


A read.table + file_ext approach (not regex but pretty easy):

dat <- read.table(text=strings1, sep="@")
dat$V3 <- tools::file_ext(strings1)
dat

##     V1            V2  V3
## 1 test       aol.com com
## 2 test   hotmail.com com
## 3 test    xyz.rr.edu edu
## 4 test abc.xx.zz.net net

Here's a purely regex approach:

do.call(rbind, strsplit(strings1, "@|\\.(?=[^\\.]+$)", perl=TRUE))

##     [,1]   [,2]        [,3] 
## [1,] "test" "aol"       "com"
## [2,] "test" "hotmail"   "com"
## [3,] "test" "xyz.rr"    "edu"
## [4,] "test" "abc.xx.zz" "net"
like image 36
Tyler Rinker Avatar answered Nov 13 '22 21:11

Tyler Rinker


So this is a negate lookahead regex that should give you the last .word of that line.

\.(?!.*\.)\w+       
like image 26
Jay Avatar answered Nov 13 '22 20:11

Jay