How to identify/delete non-UTF-8 characters in R

Tags:

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

239

asked Jun 25 '13 07:06

Marcel Hebing

2 Answers

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "fa\xE7ile" Encoding(x) <- "UTF-8" iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by '' "faile"

Here note that if we choose the right encoding:

x <- "fa\xE7ile" Encoding(x) <- "latin1" xx <- iconv(x, "latin1", "UTF-8",sub='') facile

answered Oct 02 '22 00:10

agstudy

Yihui's xfun package has a function, read_utf8, that attempts to read a file and assumes it is encoded as UTF-8. If the file contains non-UTF-8 lines, a warning is triggered, letting you know which line(s) contain non-UTF-8 characters. Under the hood it uses a non exported function xfun:::invalid_utf8() which is simply the following: which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))).

To detect specific non-UTF-8 words in a string, you could modify the above slightly and do something like:

invalid_utf8_ <- function(x){    !is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))  }  detect_invalid_utf8 <- function(string, seperator){    stringSplit <- unlist(strsplit(string, seperator))    invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_))    data.frame(     word = stringSplit[invalidIndex],     stringIndex = which(invalidIndex == TRUE)   )  }  x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade"  detect_invalid_utf8(x, " ")  #     word stringIndex # 1 façile    5 # 2 façade    9

answered Oct 02 '22 00:10

conrad-mac

Related questions
                            
                                R grep: Match one string against multiple patterns
                            
                                Why does as.matrix add extra spaces when converting numeric to character?
                            
                                R Packages - What is the file 'zzz.R' used for?
                            
                                How to set up conda-installed R for use with RStudio?
                            
                                Row/column counter in 'apply' functions
                            
                                How to index an element of a list object in R
                            
                                Install R packages using docker file
                            
                                setting seed locally (not globally) in R
                            
                                What does this mean: unable to find an inherited method for function ‘A’ for signature ‘"B"’
                            
                                suppress messages displayed by "print" instead of "message" or "warning" in R
                            
                                removing all the space between two ggplots combined with grid.arrange
                            
                                differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?
                            
                                Fast Levenshtein distance in R?
                            
                                Replacing NAs in R with nearest value
                            
                                Remove spacing around plotting area in r
                            
                                How does one do a full join using data.table?
                            
                                R says "Cannot take a sample larger than the population" -- but I am not taking a sample larger than the population
                            
                                How to print R variables in middle of String
                            
                                ggplot: text printed by geom_text is not clear
                            
                                Is there a way to `source()` and continue after an error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to identify/delete non-UTF-8 characters in R

Tags:

r

utf-8

stata

Marcel Hebing

People also ask

2 Answers

agstudy

conrad-mac

Recent Activity

Donate For Us