I'm looking for a function which can extract the domain name from a URL in R. Any function which is similar to tldextract in R? EDIT: Currently i'm using the below approach:
domain=substr(as.character("www.google.com"), 
   which(strsplit("www.google.com",'')[[1]]=='.')[1]+1, nchar("www.google.com"))
But i'm looking for a pre-defined function which can save the coding effort.
PHP's parse_url function makes it easy to extract the domain, path and other useful bits of information from a full URL.
You can also use the relatively new urltools package:
library(urltools)
URLs <- c("http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r",
          "http://www.talkstats.com/", "www.google.com")
suffix_extract(domain(URLs))
##                host subdomain        domain suffix
## 1 stackoverflow.com      <NA> stackoverflow    com
## 2 www.talkstats.com       www     talkstats    com
## 3    www.google.com       www        google    com
It's backed by Rcpp so it's wicked fast (significantly more so than using built- in R apply functions.
I don't know of a function in a package to do this.  I don't think there's anything in base install of R.  Use a user defined function and store it some where to source later or make your own package with it.
x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r"
x2 <- "http://www.talkstats.com/"
x3 <- "www.google.com"
domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]]
domain(x3)
sapply(list(x1, x2, x3), domain)
## [1] "stackoverflow.com" "talkstats.com"     "google.com"
                        I just wrote this regex which can be applied to emails and websites for extracting and matching on domain. The regex could be modified to extract different parts, and it is vectorized. I do some additional processing which is totally optional.
invalid_domains = "yahoo.com|aol.com|gmail.com|hotmail.com|comcast.net|me.com|att.net|verizon.net|live.com|sbcglobal.net|msn.com|outlook.com|ibm.com"
domain <- function(x){
  to_return <- tolower(as.character(x))
  to_return <- gsub('.*[.@/]+([^.@:/]+[.][a-z]+)(/.*$|$)','\\1',x,ignore.case=T) # extract domain. \\1 selects just the first group.
  to_return <- gsub(".ocm", ".com", to_return) # correct mispellings
  # to_return <- ifelse(grepl(invalid_domains,to_return),NA,to_return) # (optional) check for invalid domains, especially when working with emails.
  to_return <- ifelse(grepl('[.]',to_return),to_return,NA) # if there is no . this is an invalid domain, return NA
  return(to_return)
}
A vectorized option using base R only and customizable super easily in its output could be
url_regexpr <- function() {
  protocol <- "([^/]+://)*"  # could be
  sub <- "([^\\.\\?/]+\\.)*"  # could be
  domain <- "([^\\.\\?/]+)"  # must be
  dot <- "(\\.)"  # must be
  suffix <- "([^/]+)"  # must be
  folders <- "(/[^\\?]*)*"  # could be
  args <- "(\\?.*)*"  #could be
  paste0(
    "^",
    protocol, sub, domain, dot, suffix, folders, args,
    "$"
  )
}
get_domain <- function(url, include_suffix = TRUE) {
  res <- paste0("\\3", c("\\4\\5")[include_suffix])
  sub(url_regexpr(), res, url)
}
I have run the following tests on it:
library(testthat)
test_that("get_domain works", {
  expect_equal(get_domain("https://www.example.com"), "example.com")
  expect_equal(get_domain("http://www.example.com"), "example.com")
  expect_equal(get_domain("www.example.com"), "example.com")
  expect_equal(get_domain("www.example.net"), "example.net")
  expect_equal(get_domain("www.example.net/baz"), "example.net")
  expect_equal(get_domain("https://www.example.net/baz"), "example.net")
  expect_equal(get_domain("https://www.example.net/baz/tar"), "example.net")
  expect_equal(get_domain("https://foo.example.net"), "example.net")
  expect_equal(get_domain("https://www.foo.example.net"), "example.net")
})
test_that("get_domain is vectorized", {
  urls <- c("www.example.com", "www.example.net")
  expect_equal(get_domain(urls), c("example.com", "example.net"))
})
test_that("can remove suffix", {
  expect_equal(
    get_domain("https://www.example.com", include_suffix = FALSE),
    "example"
  )
})
test_that("works with file extensions", {
  expect_equal(
    get_domain("https://www.example.com/foo.php"),
    "example.com"
  )
})
test_that("works against leading slash", {
  expect_equal(
    get_domain("http://m.example.com/"),
    "example.com"
  )
})
test_that("works against args after slash", {
  expect_equal(
    get_domain("http://example.com/?"),
    "example.com"
  )
})
test_that("works against multiple dots after slash", {
  expect_equal(
    get_domain("http://example.com/foo.net.bar"),
    "example.com"
  )
})
test_that("generalized protocols", {
  expect_equal(
    get_domain("android-app://example.com/foo.net.bar"),
    "example.com"
  )
})
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With