Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove non printable white spaces from unknown (to me) encoding

So I've parsed some XML file using the r XML package using the following code

library(XML) 
data <- xmlToDataFrame(xmlParse("Some file I can't share.xml"))

Everything worked fine and I've got the expected result

dim(data)
## [1] 554560 13

The only problem though that some of my entries looks as follows

x <- "2  irfl014"
x
## [1] "2  \002\003\004\003\005\005\006\005\002\003\004\003\005\005\006\005irfl014"

Tried to identify the encoding (with no success)

Encoding(x)
## [1] "unknown"

library(stringi)
stri_enc_detect(x)
# [[1]]
# [[1]]$Encoding
# [1] "UTF-8"     "Shift_JIS" "GB18030"   "EUC-JP"    "EUC-KR"    "Big5"     
# 
# [[1]]$Language
# [1] ""   "ja" "zh" "ja" "ko" "zh"
# 
# [[1]]$Confidence
# [1] 0.1 0.1 0.1 0.1 0.1 0.1

Encoding isn't my strongest field of expertise, is there any simple way to convert x to simply

x
## [1] "2  irfl014"
like image 707
David Arenburg Avatar asked Feb 10 '15 13:02

David Arenburg


1 Answers

x <- "2  \002\003\004\003\005\005\006\005\002\003\004\003\005\005\006\005irfl014"

cat(x)
# 2  irfl014

The special characters, e.g., "\002" are non-printable control characters. See here for more information.

You can use the following gsub command to remove all control characters:

gsub("[[:cntrl:]]+", "", x)
# [1] "2  irfl014"
like image 131
Sven Hohenstein Avatar answered Sep 25 '22 13:09

Sven Hohenstein