Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URL / URI encoding in R

I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.

For instance, this argument :

quel écrivain ?

should be encoded like this:

quel%20%C3%A9crivain%20%3F%0D%0A

Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:

URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"

The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...

I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.

A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :

"1\tquel\tquel\tDET\tDET\tGender=Masc|Number=Sing\t5\tdet\t0\t_\n4\t<U+FFFD>\t<U+FFFD>\tSYM\tSYM\t_\t5\tcompound\t0\t_\n5\tcrivain\tcrivain\

As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".

I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!

like image 839
Tau Avatar asked Dec 20 '17 11:12

Tau


2 Answers

Set reserved = TRUE

i.e.

your_string <- "quel écrivain ?"

URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"
like image 106
stevec Avatar answered Sep 27 '22 17:09

stevec


I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.

Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.

You might have to adapt the replacement pattern \\u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.

library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\\"
str <- gsub("%", paste0("\\", "u00"), str , fixed = T)
#[1] "quel\\u0020\\u00E9crivain\\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"
like image 36
Manuel Bickel Avatar answered Sep 27 '22 17:09

Manuel Bickel