Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using urltools::url_parse with UTF-8 domains

Tags:

r

url-parsing

The function url_parse is very fast and works fine most of the time. But recently, domain names may contain UTF-8 characters, for example

url <- "www.cordes-tiefkühlprodukte.de"

Now if I apply url_parse on this url, I get a special character "< fc >" in the domain column:

url_parse(url)
  scheme                            domain port path parameter fragment
1   <NA> www.cordes-tiefk<fc>hlprodukte.de <NA> <NA>      <NA>     <NA>

My question is: How can I "fix" this entry to UTF-8? I tried iconv and some functions from the stringi package, but with no success.

(I am aware of httr::parse_url, which does not have this problem. So one approach would be to detect the urls that are not ascii, and use url_parse on those and parse_url on the few special cases. However, this leads to the problem to (efficiently) detect the non-ascii URLs.)

EDIT: Unfortunately, url1 <- URLencode(enc2utf8(url)) does not help. When I do

robotstxt::paths_allowed(
    url1, 
    domain=urltools::suffix_extract(urltools::domain(url1))
)

I get an error could not resolve host. However, plugging in the original URL and the 2nd level domain by hand, paths_allowed works.

> sessionInfo()

R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] urltools_1.7.3 fortunes_1.5-4

loaded via a namespace (and not attached): [1] compiler_3.6.1 Rcpp_1.0.1 triebeard_0.3.0

like image 941
Karsten W. Avatar asked Jul 16 '19 15:07

Karsten W.


2 Answers

I could reproduce the issue. I could convert the column domain to UTF-8 by reading it with readr::parse_character and latin1 encoding:

library(urltools)
library(tidyverse)

url <- "www.cordes-tiefkühlprodukte.de"

parts <- 
  url_parse(url) %>% 
  mutate(domain = parse_character(domain, locale = locale(encoding = "latin1")))

parts

  scheme                         domain port path parameter fragment
1   <NA> www.cordes-tiefkühlprodukte.de <NA> <NA>      <NA>     <NA>

I guess that the encoding you have to specify (here latin1) depends only on your locale and not on the url's special characters, but I'm not 100% sure about that.

like image 157
AEF Avatar answered Nov 15 '22 06:11

AEF


Just for reference, another method that works fine for me is:

library(stringi)
url <- "www.cordes-tiefkühlprodukte.de"
url <- stri_escape_unicode(url)
dat <- urltools::parse_url(url)
for(cn in colnames(dat)) dat[,cn] <- stri_unescape_unicode(dat[,cn])
like image 41
Karsten W. Avatar answered Nov 15 '22 05:11

Karsten W.