I have this df:
dput(df)
structure(list(URLs = c("http://bursesvp.ro//portal/user/_/Banco_Votorantim_Cartoes/0-7f2f5cb67f1-22918b.html",
"http://46.165.216.78/.CartoesVotorantim/Usuarios/Cadastro/BV6102891782/",
"http://www.chalcedonyhotel.com/images/promoc/premiado.tam.fidelidade/",
"http://bmbt.ro/portal/a3/_Votorantim_/VotorantimCartoes2016/0-7f2f5cb67f1-22928b.html",
"http://voeazul.nl/azul/")), .Names = "URLs", row.names = c(NA,
-5L), class = "data.frame")
It describes different URLs and I am trying to count the number of characters of the host name, whether that is an actual name(http://hostname.com/....) or an IP(http://000.000.000.000/...). However, if it is an actual name, then I only want the nchar between www. and .com. If it's an IP then all its numbers and "in between" dots.
Expected Outcome for the above sample data:
exp_outcome
1 8
2 13
3 15
4 4
5 7
I tried to do something with strsplit
but could not get anywhere.
Another, maybe more direct way with a different regex:
nchar(sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df))
#[1] 8 13 15 4 7
explanation:
^http://
: looks for "http://" after beginning of the string (www\\.)?
: looks for "www.", zero or one time (so this is optional) (([a-z]+)|([0-9.]+))
: the pattern that will be captured : either lowercase letters one or more time or digits and points (\\.[a-z]+)?
: looks for "." followed by one or more lowercase letters, zero or one time (so again optional) /+.+$
: looks for "/" followed by anything, one or more times till the end of stringNB:
sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df)
# [1] "bursesvp" "46.165.216.78" "chalcedonyhotel" "bmbt" "voeazul"
Here’s how to do it (assuming your data.frame
is called x
):
domains = sub('^(http://)([^/]+)(.*)$', '\\2', x$df)
# This will fail for IP addresses …
hostname = sub('^(www\\.)?([^.]+)(\\..+)?$', '\\2', domains)
# … which we treat separately here:
is_ip = grepl('^(\\d{1,3}\\.){3}\\d{1,3}$', domains)
hostname[is_ip] = domains[is_ip]
exp_outcome$domain_length = nchar(hostname)
On a side note, I converted your original data.frame to character strings — it simply makes no sense to use a factor
for URLs.
After 5 months of dealing with URLs in general, I found the following packages which make life a bit easier (Regex provided by other answers do work great by the way),
library(urltools)
library(iptools)
df$Hostname <- domain(df$URLs)
#However, TLDs and 'www' need to go so I used suffix_extract()$domain from `iptools`
df$Hostname <- ifelse(is.na(suffix_extract(df$Hostname)$domain), df$Hostname,
suffix_extract(df$Hostname)$domain)
#which gives:
# URLs Hostname
#1 http://bursesvp.ro//portal/user/_/... bursesvp
#2 http://46.165.216.78/.CartoesVotorantim/Usuarios/... 46.165.216.78
#3 http://www.chalcedonyhotel.com/images/promoc/ chalcedonyhotel
#4 http://bmbt.ro/portal/a3/_Votorantim_/... bmbt
#5 http://voeazul.nl/azul/ voeazul
#then simply,
nchar(df$Hostname)
#[1] 8 13 15 4 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With