Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract string elements that possibly appear multiple times, or not at all

Start with a character vector of URLs. The goal is to end up with only the name of the company, meaning a column with only "test", "example" and "sample" in the example below.

urls <- c("http://grand.test.com/", "https://example.com/", 
          "http://.big.time.sample.com/")

Remove the ".com" and whatever might follow it and keep the first part:

urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=T), "[", 1) 

urls
# [1] "http://grand.test"    "https://example"      "http://.big.time.sample"

My next step is to remove the http:// and https:// portions with a chained gsub() call:

urls <- gsub("^http://", "",  gsub("^https://", "", urls))

urls
# [1] "grand.test"       "example"          ".big.time.sample"

But here is where I need help. How do I handle the multiple periods (dots) before the company name in the first and third strings of urls? For example, the call below returns NA for the second string, since the "example" string has no period remaining. Or if I retain only the first part, I lose a company name.

urls  <- sapply(strsplit(urls, split = "\\."), "[", 2)
urls
# [1] "test" NA     "big"

urls  <- sapply(strsplit(urls, split = "\\."), "[", 1)
urls
# [1] "grand"   "example" ""  

Perhaps an ifelse() call that counts the number of periods remaining and only uses strsplit if there is more than one period? Also note that it is possible there are two or more periods before the company name. I don't know how to do lookarounds, which might solve my problem. But this didn't

strsplit(urls, split="(?=\\.)", perl=T)

Thank you for any suggestions.

like image 450
lawyeR Avatar asked Jun 19 '14 21:06

lawyeR


4 Answers

Here's an approach that may be easier to understand and to generalize than some of the others:

pat = "(.*?)(\\w+)(\\.com.*)"
gsub(pat, "\\2", urls)

It works by breaking each string up into three capture groups that together match the entire string, and substituting back in just capture group (2), the one that you want.

pat = "(.*?)(\\w+)(\\.com.*)"
#        ^    ^       ^
#        |    |       |
#       (1)  (2)     (3)  

Edit (adding explanation of ? modifier):

Do note that capture group (1) needs to include the "ungreedy" or "minimal" quantifier ? (also sometimes called "lazy" or "reluctant"). It essentially tells the regex engine to match as many characters as it can ... without using up any that could otherwise become a part of the following capture group (2).

Without a trailing ?, repetition quantifiers are by default greedy; in this case, a greedy capture group, (.*), since it matches any number of any type of characters, would "eat up" all characters in the string, leaving none at all for the other two capture groups -- not a behavior we want!

like image 197
Josh O'Brien Avatar answered Nov 11 '22 14:11

Josh O'Brien


I think there should be simpler but this works:

 sub('.*[.]','',sub('https?:[/]+[.]?(.*)[.]com[/]','\\1',urls))
 [1] "test"    "example" "sample" 

Where "urls" is you firs url's vector.

like image 3
agstudy Avatar answered Nov 11 '22 13:11

agstudy


I think there will be a way to just extract the word before '.com` but maybe gives an idea

sub(".com", "", regmatches(urls, gregexpr("(\\w+).com", urls)))
like image 3
user20650 Avatar answered Nov 11 '22 14:11

user20650


Using strsplit might be worth a try too:

sapply(strsplit(urls,"/|\\."),function(x) tail(x,2)[1])
#[1] "test"    "example" "sample"
like image 2
thelatemail Avatar answered Nov 11 '22 12:11

thelatemail