So I've parsed some XML file using the r XML
package using the following code
library(XML)
data <- xmlToDataFrame(xmlParse("Some file I can't share.xml"))
Everything worked fine and I've got the expected result
dim(data)
## [1] 554560 13
The only problem though that some of my entries looks as follows
x <- "2 irfl014"
x
## [1] "2 \002\003\004\003\005\005\006\005\002\003\004\003\005\005\006\005irfl014"
Tried to identify the encoding (with no success)
Encoding(x)
## [1] "unknown"
library(stringi)
stri_enc_detect(x)
# [[1]]
# [[1]]$Encoding
# [1] "UTF-8" "Shift_JIS" "GB18030" "EUC-JP" "EUC-KR" "Big5"
#
# [[1]]$Language
# [1] "" "ja" "zh" "ja" "ko" "zh"
#
# [[1]]$Confidence
# [1] 0.1 0.1 0.1 0.1 0.1 0.1
Encoding isn't my strongest field of expertise, is there any simple way to convert x
to simply
x
## [1] "2 irfl014"
x <- "2 \002\003\004\003\005\005\006\005\002\003\004\003\005\005\006\005irfl014"
cat(x)
# 2 irfl014
The special characters, e.g., "\002"
are non-printable control characters. See here for more information.
You can use the following gsub
command to remove all control characters:
gsub("[[:cntrl:]]+", "", x)
# [1] "2 irfl014"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With