URL / URI encoding in R

Question

I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.

For instance, this argument :

quel écrivain ?

should be encoded like this:

quel%20%C3%A9crivain%20%3F%0D%0A

Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:

URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"

The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...

I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.

A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :

"1	quel	quel	DET	DET	Gender=Masc|Number=Sing	5	det	0	_
4	<U+FFFD>	<U+FFFD>	SYM	SYM	_	5	compound	0	_
5	crivain	crivain\

As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".

I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!

stevec · Accepted Answer

Set reserved = TRUE

i.e.

your_string <- "quel écrivain ?"

URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"

Manuel Bickel · Answer

I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.

Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.

You might have to adapt the replacement pattern \u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.

library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\"
str <- gsub("%", paste0("\", "u00"), str , fixed = T)
#[1] "quel\u0020\u00E9crivain\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"

URL / URI encoding in R

Tags:

post

parsing

r

encoding

get

Tau

2 Answers

stevec

Manuel Bickel

Recent Activity

Donate For Us

URL / URI encoding in R

Tags:

post

parsing

r

encoding

get

Tau

2 Answers

stevec

Manuel Bickel

Related questions

Recent Activity

Donate For Us