Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URLdecode with characters encoded by more than one %

Tags:

r

urldecode

This should be an easy one.

Let's suppose I have this string in R:

a <- "%C3%B6sterlich

What this means is:

österlich (which means 'easterly' in German)

However, if I do URLdecode(a), I get:

[1] "österlich"

This makes sense in a way, because %C3 is the à and %B6 is the ¶ in ASCII URL encoding. But as you can see here: http://www.backbone.se/urlencodingUTF8.htm , %C3%B6 means ö in UTF-8 encoding.

Now the question: How do I tell URLdecode() to use the UTF-8 table?

like image 385
swolf Avatar asked Mar 23 '23 23:03

swolf


2 Answers

I finally found the way to solve this problem. Here's my use case and what I tried.

These are from scraping Wikipedia using rvest, so there shouldn't be a problem. All contain % but not all cause problems.

#problem strings
problem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree", 
  "J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan", 
  "Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova", 
  "Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique", 
  "Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien", 
  "Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor", 
  "Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd", 
  "Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal", 
  "Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale", 
  "No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz"
)

First try base-r solution. It's not vectorized for some reason, so we use purrr:

#utils::URLdecode
problem_strs %>% purrr::map_chr(utils::URLdecode)

[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker"            "Vicco_von_Bülow"         
[5] "Bülent_Ceylan"            "Seán_Cullen"              "Chris_D'Elia"              "Uğur_Rıfat_Karlova"     
[9] "Mike_Krüger"              "Andrés_López_Forero"     "Mo'Nique"                  "José_Sánchez_Mota"      
[13] "Dara_Ó_Briain"            "Conan_O'Brien"             "Mike_O'Brien_(actor)"      "Carroll_O'Connor"         
[17] "Donald_O'Connor"           "Rosie_O'Donnell"           "Michael_O'Donoghue"        "Chris_O'Dowd"             
[21] "Ardal_O'Hanlon"            "Catherine_O'Hara"          "Patrice_O'Neal"            "Barunka_O'Shaughnessy"    
[25] "Raven-Symoné"             "Charles_\"Chic\"_Sale"     "Noël_Wells"               "\"Weird_Al\"_Yankovic"    
[29] "Cem_Yılmaz"

If we compare these to the ones before, we can see the pattern: those with 2 %'s cause problems. So I read all questions related to url decoding for R and found these suggested solutions:

#urltools::url_decode
urltools::url_decode(problem_strs)

Same result as before.

What is the encoding? Try to set to UTF-8:

> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> #try to set
> Encoding(problem_strs) = "UTF-8"
> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(problem_strs) = "utf8"
> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> urltools::url_decode(problem_strs)

Same output as before.

Some suggested another way to check and set:

> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8")
> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"

And I found another package on the list:

> #Ruchardet to detect?
> Ruchardet::detectEncoding(problem_strs)
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

#Is it simpler than we thought?
urltools::url_decode(problem_strs) %>% urltools::url_decode()

Same output.

So I googled around for a specific pattern that causes problems, such as %C3%BC. So, there is a half-supplied answer here for php.

First you need to urldecode it, this will give you ü, which is the UTF8-encoded representation of ü, so you should be all good.

OK, let's try that in R:

#url decode, then set utf
halfway = urltools::url_decode(problem_strs)
Encoding(halfway) = "UTF-8"
halfway
 [1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker"             "Vicco_von_Bülow"          
 [5] "Bülent_Ceylan"             "Seán_Cullen"               "Chris_D'Elia"              "Uğur_Rıfat_Karlova"       
 [9] "Mike_Krüger"               "Andrés_López_Forero"       "Mo'Nique"                  "José_Sánchez_Mota"        
[13] "Dara_Ó_Briain"             "Conan_O'Brien"             "Mike_O'Brien_(actor)"      "Carroll_O'Connor"         
[17] "Donald_O'Connor"           "Rosie_O'Donnell"           "Michael_O'Donoghue"        "Chris_O'Dowd"             
[21] "Ardal_O'Hanlon"            "Catherine_O'Hara"          "Patrice_O'Neal"            "Barunka_O'Shaughnessy"    
[25] "Raven-Symoné"              "Charles_\"Chic\"_Sale"     "Noël_Wells"                "\"Weird_Al\"_Yankovic"    
[29] "Cem_Yılmaz"               

Here's a reusable function:

url_decode_utf = function(x) {
  y = urltools::url_decode(x)
  Encoding(y) = "UTF-8"
  y
}
like image 86
CoderGuy123 Avatar answered Apr 06 '23 01:04

CoderGuy123


Try this:

> Encoding(a) <- "UTF-8"

Or use iconv function:
http://stat.ethz.ch/R-manual/R-devel/library/base/html/iconv.html http://astrostatistics.psu.edu/datasets/2006tutorial/html/utils/html/iconv.html

Hope it helps ^_^

like image 45
Alesanco Avatar answered Apr 06 '23 01:04

Alesanco