Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify/delete non-UTF-8 characters in R

Tags:

r

utf-8

stata

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

like image 239
Marcel Hebing Avatar asked Jun 25 '13 07:06

Marcel Hebing


People also ask

Where can I find non UTF characters?

To identify the Non Unicode characters we can use either Google Chrome or Mozilla firefox browser by just dragging and dropping the file to the browser. Chrome will show us only the row and column number of the .

What is a non UTF-8 character?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages.


2 Answers

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "fa\xE7ile" Encoding(x) <- "UTF-8" iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by '' "faile" 

Here note that if we choose the right encoding:

x <- "fa\xE7ile" Encoding(x) <- "latin1" xx <- iconv(x, "latin1", "UTF-8",sub='') facile 
like image 85
agstudy Avatar answered Oct 02 '22 00:10

agstudy


Yihui's xfun package has a function, read_utf8, that attempts to read a file and assumes it is encoded as UTF-8. If the file contains non-UTF-8 lines, a warning is triggered, letting you know which line(s) contain non-UTF-8 characters. Under the hood it uses a non exported function xfun:::invalid_utf8() which is simply the following: which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))).

To detect specific non-UTF-8 words in a string, you could modify the above slightly and do something like:

invalid_utf8_ <- function(x){    !is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))  }  detect_invalid_utf8 <- function(string, seperator){    stringSplit <- unlist(strsplit(string, seperator))    invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_))    data.frame(     word = stringSplit[invalidIndex],     stringIndex = which(invalidIndex == TRUE)   )  }  x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade"  detect_invalid_utf8(x, " ")  #     word stringIndex # 1 façile    5 # 2 façade    9 
like image 31
conrad-mac Avatar answered Oct 02 '22 00:10

conrad-mac