Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function to extract domain name from URL in R

Tags:

r

I'm looking for a function which can extract the domain name from a URL in R. Any function which is similar to tldextract in R? EDIT: Currently i'm using the below approach:

domain=substr(as.character("www.google.com"), 
   which(strsplit("www.google.com",'')[[1]]=='.')[1]+1, nchar("www.google.com"))

But i'm looking for a pre-defined function which can save the coding effort.

like image 851
Prasun Velayudhan Avatar asked Sep 26 '13 06:09

Prasun Velayudhan


People also ask

Which command will extract the domain suffix from the string?

PHP's parse_url function makes it easy to extract the domain, path and other useful bits of information from a full URL.


4 Answers

You can also use the relatively new urltools package:

library(urltools)

URLs <- c("http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r",
          "http://www.talkstats.com/", "www.google.com")

suffix_extract(domain(URLs))

##                host subdomain        domain suffix
## 1 stackoverflow.com      <NA> stackoverflow    com
## 2 www.talkstats.com       www     talkstats    com
## 3    www.google.com       www        google    com

It's backed by Rcpp so it's wicked fast (significantly more so than using built- in R apply functions.

like image 100
hrbrmstr Avatar answered Oct 16 '22 10:10

hrbrmstr


I don't know of a function in a package to do this. I don't think there's anything in base install of R. Use a user defined function and store it some where to source later or make your own package with it.

x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r"
x2 <- "http://www.talkstats.com/"
x3 <- "www.google.com"

domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]]

domain(x3)
sapply(list(x1, x2, x3), domain)
## [1] "stackoverflow.com" "talkstats.com"     "google.com"
like image 34
Tyler Rinker Avatar answered Oct 16 '22 09:10

Tyler Rinker


I just wrote this regex which can be applied to emails and websites for extracting and matching on domain. The regex could be modified to extract different parts, and it is vectorized. I do some additional processing which is totally optional.

invalid_domains = "yahoo.com|aol.com|gmail.com|hotmail.com|comcast.net|me.com|att.net|verizon.net|live.com|sbcglobal.net|msn.com|outlook.com|ibm.com"
domain <- function(x){
  to_return <- tolower(as.character(x))
  to_return <- gsub('.*[.@/]+([^.@:/]+[.][a-z]+)(/.*$|$)','\\1',x,ignore.case=T) # extract domain. \\1 selects just the first group.
  to_return <- gsub(".ocm", ".com", to_return) # correct mispellings
  # to_return <- ifelse(grepl(invalid_domains,to_return),NA,to_return) # (optional) check for invalid domains, especially when working with emails.
  to_return <- ifelse(grepl('[.]',to_return),to_return,NA) # if there is no . this is an invalid domain, return NA
  return(to_return)
}
like image 1
Bryce Chamberlain Avatar answered Oct 16 '22 08:10

Bryce Chamberlain


A vectorized option using base R only and customizable super easily in its output could be

url_regexpr <- function() {
  protocol <- "([^/]+://)*"  # could be
  sub <- "([^\\.\\?/]+\\.)*"  # could be
  domain <- "([^\\.\\?/]+)"  # must be
  dot <- "(\\.)"  # must be
  suffix <- "([^/]+)"  # must be
  folders <- "(/[^\\?]*)*"  # could be
  args <- "(\\?.*)*"  #could be

  paste0(
    "^",
    protocol, sub, domain, dot, suffix, folders, args,
    "$"
  )
}

get_domain <- function(url, include_suffix = TRUE) {
  res <- paste0("\\3", c("\\4\\5")[include_suffix])
  sub(url_regexpr(), res, url)
}

I have run the following tests on it:

library(testthat)
test_that("get_domain works", {
  expect_equal(get_domain("https://www.example.com"), "example.com")
  expect_equal(get_domain("http://www.example.com"), "example.com")
  expect_equal(get_domain("www.example.com"), "example.com")

  expect_equal(get_domain("www.example.net"), "example.net")
  expect_equal(get_domain("www.example.net/baz"), "example.net")

  expect_equal(get_domain("https://www.example.net/baz"), "example.net")
  expect_equal(get_domain("https://www.example.net/baz/tar"), "example.net")

  expect_equal(get_domain("https://foo.example.net"), "example.net")
  expect_equal(get_domain("https://www.foo.example.net"), "example.net")
})

test_that("get_domain is vectorized", {
  urls <- c("www.example.com", "www.example.net")
  expect_equal(get_domain(urls), c("example.com", "example.net"))
})


test_that("can remove suffix", {
  expect_equal(
    get_domain("https://www.example.com", include_suffix = FALSE),
    "example"
  )
})

test_that("works with file extensions", {
  expect_equal(
    get_domain("https://www.example.com/foo.php"),
    "example.com"
  )
})

test_that("works against leading slash", {
  expect_equal(
    get_domain("http://m.example.com/"),
    "example.com"
  )
})

test_that("works against args after slash", {
  expect_equal(
    get_domain("http://example.com/?"),
    "example.com"
  )
})

test_that("works against multiple dots after slash", {
  expect_equal(
    get_domain("http://example.com/foo.net.bar"),
    "example.com"
  )
})

test_that("generalized protocols", {
  expect_equal(
    get_domain("android-app://example.com/foo.net.bar"),
    "example.com"
  )
})
like image 1
Corrado Avatar answered Oct 16 '22 10:10

Corrado