Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does R have any package for parsing out the parts of a URL?

Tags:

url

parsing

r

I have a list of urls that I would like to parse and normalize.

I'd like to be able to split each address into parts so that I can identify "www.google.com/test/index.asp" and "google.com/somethingelse" as being from the same website.

like image 670
Rob Donnelly Avatar asked Jun 24 '13 21:06

Rob Donnelly


4 Answers

You can use the function of the R package httr

 parse_url(url) 
 >parse_url("http://google.com/")

You can get more details here: http://cran.r-project.org/web/packages/httr/httr.pdf

like image 147
Abdocia Avatar answered Oct 18 '22 16:10

Abdocia


Since parse_url() uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub call.

Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.

  • ^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
  • Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
  • Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
  • The rest we ignore: .*$

Now we plug together the regexes above, and the extraction of the hostname becomes:

PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)

Change host name regex to include (but not capture) the port:

HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"

And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:

> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                "http://test.com/?ex"))
[1] "test.server.com" "google.com"      "test.com"       
like image 21
krlmlr Avatar answered Oct 18 '22 14:10

krlmlr


There's also the urltools package, now, which is infinitely faster:

urltools::url_parse(c("www.google.com/test/index.asp", 
                      "google.com/somethingelse"))

##                  scheme         domain port           path parameter fragment
## 1        www.google.com      test/index.asp                   
## 2            google.com       somethingelse                   
like image 5
hrbrmstr Avatar answered Oct 18 '22 14:10

hrbrmstr


I'd forgo a package and use regex for this.

EDIT reformulated after the robot attack from Dason...

x <- c("talkstats.com", "www.google.com/test/index.asp", 
    "google.com/somethingelse", "www.stackoverflow.com",
    "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")

parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
parser(x)

lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
names(lst) <- unique(parser(x))
lst

## $talkstats.com
## [1] "talkstats.com"
## 
## $google.com
## [1] "www.google.com/test/index.asp" "google.com/somethingelse"     
## 
## $stackoverflow.com
## [1] "www.stackoverflow.com"
## 
## $bing.com
## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="

This may need to be extended depending on the structure of the data.

like image 4
Tyler Rinker Avatar answered Oct 18 '22 16:10

Tyler Rinker