The function url_parse
is very fast and works fine most of the time. But recently, domain names may contain UTF-8 characters, for example
url <- "www.cordes-tiefkühlprodukte.de"
Now if I apply url_parse
on this url, I get a special character "< fc >" in the domain column:
url_parse(url)
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefk<fc>hlprodukte.de <NA> <NA> <NA> <NA>
My question is: How can I "fix" this entry to UTF-8? I tried iconv
and some functions from the stringi
package, but with no success.
(I am aware of httr::parse_url
, which does not have this problem. So one approach would be to detect the urls that are not ascii, and use url_parse
on those and parse_url
on the few special cases. However, this leads to the problem to (efficiently) detect the non-ascii URLs.)
EDIT: Unfortunately, url1 <- URLencode(enc2utf8(url))
does not help. When I do
robotstxt::paths_allowed(
url1,
domain=urltools::suffix_extract(urltools::domain(url1))
)
I get an error could not resolve host
. However, plugging in the original URL and the 2nd level domain by hand, paths_allowed
works.
> sessionInfo()
R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] urltools_1.7.3 fortunes_1.5-4
loaded via a namespace (and not attached): [1] compiler_3.6.1 Rcpp_1.0.1 triebeard_0.3.0
I could reproduce the issue. I could convert the column domain
to UTF-8 by reading it with readr::parse_character
and latin1
encoding:
library(urltools)
library(tidyverse)
url <- "www.cordes-tiefkühlprodukte.de"
parts <-
url_parse(url) %>%
mutate(domain = parse_character(domain, locale = locale(encoding = "latin1")))
parts
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefkühlprodukte.de <NA> <NA> <NA> <NA>
I guess that the encoding you have to specify (here latin1
) depends only on your locale and not on the url's special characters, but I'm not 100% sure about that.
Just for reference, another method that works fine for me is:
library(stringi)
url <- "www.cordes-tiefkühlprodukte.de"
url <- stri_escape_unicode(url)
dat <- urltools::parse_url(url)
for(cn in colnames(dat)) dat[,cn] <- stri_unescape_unicode(dat[,cn])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With