Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert string with hidden characters

Tags:

string

r

raw

I have some character strings which I'm getting from an html. Turns out, these strings have some hidden characters or controls (?).

How can I convert this string so that it only contains the visible characters?

Take for example the term "Besucherüberblick" and its raw representation:

charToRaw("Besucherüberblick")
 [1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

However, from my html, I'm getting:

[1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

So there are these three weird thingies at the beginning.

I could probably trial and error and manually remove these from my raw vector and then convert it back to character, but a) I don't know in advance which strings the html will give me and b) I'm looking for an automated solution.

Maybe there's some stringr/stringi solution to it?

like image 990
deschen Avatar asked Mar 07 '26 10:03

deschen


1 Answers

Those first three bytes (e2 80 8c) are the UTF-8 encoding for the zero width non-joiner unicode character. You can remove those all other other non-printable control characters with the \p{Format} regular expression class which should contain the invisible formatting indicators (see other groups here). You can view the ~160 characters in that class here.

x <- rawToChar(as.raw(c(226, 128, 140, 66, 101, 115, 117, 99, 104, 101, 114, 195, 188, 
      98, 101, 114, 98, 108, 105, 99, 107)))
x
# [1] "‌Besucherüberblick"
charToRaw(x)
#  [1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b


y <- stringr::str_remove_all(x, "[\\p{Format}]") 
y
# [1] "Besucherüberblick"
charToRaw(y)
#  [1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

Another good choice might be \p{Other} if you want to exclude other control characters or unassigned values, etc. That will exclude all the following categories: \p{Control} (an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F which include things like tabs and newline characters), \p{Format} (invisible formatting indicator), \p{Private_Use}: (any code point reserved for private use), \p{Surrogate} (one half of a surrogate pair in UTF-16 encoding) and \p{Unassigned} (any code point to which no character has been assigned)

like image 132
MrFlick Avatar answered Mar 09 '26 03:03

MrFlick