Extract string elements that possibly appear multiple times, or not at all

Question

Start with a character vector of URLs. The goal is to end up with only the name of the company, meaning a column with only "test", "example" and "sample" in the example below.

urls <- c("http://grand.test.com/", "https://example.com/", 
          "http://.big.time.sample.com/")

Remove the ".com" and whatever might follow it and keep the first part:

urls <- sapply(strsplit(urls, split="(?<=.)(?=\.com)", perl=T), "[", 1) 

urls
# [1] "http://grand.test"    "https://example"      "http://.big.time.sample"

My next step is to remove the http:// and https:// portions with a chained gsub() call:

urls <- gsub("^http://", "",  gsub("^https://", "", urls))

urls
# [1] "grand.test"       "example"          ".big.time.sample"

But here is where I need help. How do I handle the multiple periods (dots) before the company name in the first and third strings of urls? For example, the call below returns NA for the second string, since the "example" string has no period remaining. Or if I retain only the first part, I lose a company name.

urls  <- sapply(strsplit(urls, split = "\."), "[", 2)
urls
# [1] "test" NA     "big"

urls  <- sapply(strsplit(urls, split = "\."), "[", 1)
urls
# [1] "grand"   "example" ""

Perhaps an ifelse() call that counts the number of periods remaining and only uses strsplit if there is more than one period? Also note that it is possible there are two or more periods before the company name. I don't know how to do lookarounds, which might solve my problem. But this didn't

strsplit(urls, split="(?=\.)", perl=T)

Thank you for any suggestions.

Josh O'Brien · Accepted Answer

Here's an approach that may be easier to understand and to generalize than some of the others:

pat = "(.*?)(\w+)(\.com.*)"
gsub(pat, "\2", urls)

It works by breaking each string up into three capture groups that together match the entire string, and substituting back in just capture group (2), the one that you want.

pat = "(.*?)(\w+)(\.com.*)"
#        ^    ^       ^
#        |    |       |
#       (1)  (2)     (3)

Edit (adding explanation of ? modifier):

Do note that capture group (1) needs to include the "ungreedy" or "minimal" quantifier ? (also sometimes called "lazy" or "reluctant"). It essentially tells the regex engine to match as many characters as it can ... without using up any that could otherwise become a part of the following capture group (2).

Without a trailing ?, repetition quantifiers are by default greedy; in this case, a greedy capture group, (.*), since it matches any number of any type of characters, would "eat up" all characters in the string, leaving none at all for the other two capture groups -- not a behavior we want!

agstudy · Answer

I think there should be simpler but this works:

 sub('.*[.]','',sub('https?:[/]+[.]?(.*)[.]com[/]','\1',urls))
 [1] "test"    "example" "sample"

Where "urls" is you firs url's vector.

user20650 · Answer

I think there will be a way to just extract the word before '.com` but maybe gives an idea

sub(".com", "", regmatches(urls, gregexpr("(\w+).com", urls)))

thelatemail · Answer

Using strsplit might be worth a try too:

sapply(strsplit(urls,"/|\."),function(x) tail(x,2)[1])
#[1] "test"    "example" "sample"

Extract string elements that possibly appear multiple times, or not at all

Tags:

substring

r

strsplit

regex-lookarounds

lawyeR

4 Answers

Josh O'Brien

agstudy

user20650

thelatemail

Recent Activity

Donate For Us

Extract string elements that possibly appear multiple times, or not at all

Tags:

substring

r

strsplit

regex-lookarounds

lawyeR

4 Answers

Josh O'Brien

agstudy

user20650

thelatemail

Related questions

Recent Activity

Donate For Us