This should be an easy one.
Let's suppose I have this string in R:
a <- "%C3%B6sterlich
What this means is:
österlich
(which means 'easterly' in German)
However, if I do URLdecode(a)
, I get:
[1] "österlich"
This makes sense in a way, because %C3 is the à and %B6 is the ¶ in ASCII URL encoding. But as you can see here: http://www.backbone.se/urlencodingUTF8.htm , %C3%B6 means ö in UTF-8 encoding.
Now the question: How do I tell URLdecode()
to use the UTF-8 table?
I finally found the way to solve this problem. Here's my use case and what I tried.
These are from scraping Wikipedia using rvest, so there shouldn't be a problem. All contain %
but not all cause problems.
#problem strings
problem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree",
"J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan",
"Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova",
"Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique",
"Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien",
"Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor",
"Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd",
"Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal",
"Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale",
"No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz"
)
First try base-r solution. It's not vectorized for some reason, so we use purrr:
#utils::URLdecode
problem_strs %>% purrr::map_chr(utils::URLdecode)
[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker" "Vicco_von_Bülow"
[5] "Bülent_Ceylan" "Seán_Cullen" "Chris_D'Elia" "Uğur_Rıfat_Karlova"
[9] "Mike_Krüger" "Andrés_López_Forero" "Mo'Nique" "José_Sánchez_Mota"
[13] "Dara_Ó_Briain" "Conan_O'Brien" "Mike_O'Brien_(actor)" "Carroll_O'Connor"
[17] "Donald_O'Connor" "Rosie_O'Donnell" "Michael_O'Donoghue" "Chris_O'Dowd"
[21] "Ardal_O'Hanlon" "Catherine_O'Hara" "Patrice_O'Neal" "Barunka_O'Shaughnessy"
[25] "Raven-Symoné" "Charles_\"Chic\"_Sale" "Noël_Wells" "\"Weird_Al\"_Yankovic"
[29] "Cem_Yılmaz"
If we compare these to the ones before, we can see the pattern: those with 2 %
's cause problems. So I read all questions related to url decoding for R and found these suggested solutions:
#urltools::url_decode
urltools::url_decode(problem_strs)
Same result as before.
What is the encoding? Try to set to UTF-8:
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> #try to set
> Encoding(problem_strs) = "UTF-8"
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(problem_strs) = "utf8"
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> urltools::url_decode(problem_strs)
Same output as before.
Some suggested another way to check and set:
> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8")
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
And I found another package on the list:
> #Ruchardet to detect?
> Ruchardet::detectEncoding(problem_strs)
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
#Is it simpler than we thought?
urltools::url_decode(problem_strs) %>% urltools::url_decode()
Same output.
So I googled around for a specific pattern that causes problems, such as %C3%BC
. So, there is a half-supplied answer here for php.
First you need to urldecode it, this will give you ü, which is the UTF8-encoded representation of ü, so you should be all good.
OK, let's try that in R:
#url decode, then set utf
halfway = urltools::url_decode(problem_strs)
Encoding(halfway) = "UTF-8"
halfway
[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker" "Vicco_von_Bülow"
[5] "Bülent_Ceylan" "Seán_Cullen" "Chris_D'Elia" "Uğur_Rıfat_Karlova"
[9] "Mike_Krüger" "Andrés_López_Forero" "Mo'Nique" "José_Sánchez_Mota"
[13] "Dara_Ó_Briain" "Conan_O'Brien" "Mike_O'Brien_(actor)" "Carroll_O'Connor"
[17] "Donald_O'Connor" "Rosie_O'Donnell" "Michael_O'Donoghue" "Chris_O'Dowd"
[21] "Ardal_O'Hanlon" "Catherine_O'Hara" "Patrice_O'Neal" "Barunka_O'Shaughnessy"
[25] "Raven-Symoné" "Charles_\"Chic\"_Sale" "Noël_Wells" "\"Weird_Al\"_Yankovic"
[29] "Cem_Yılmaz"
Here's a reusable function:
url_decode_utf = function(x) {
y = urltools::url_decode(x)
Encoding(y) = "UTF-8"
y
}
Try this:
> Encoding(a) <- "UTF-8"
Or use iconv function:
http://stat.ethz.ch/R-manual/R-devel/library/base/html/iconv.html http://astrostatistics.psu.edu/datasets/2006tutorial/html/utils/html/iconv.html
Hope it helps ^_^
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With