Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing non-ASCII characters from data files

I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can't seem to locate one to remove/clean them.

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?

like image 453
Maiasaura Avatar asked Mar 29 '12 23:03

Maiasaura


People also ask

How do you remove non ASCII characters?

Use . replace() method to replace the Non-ASCII characters with the empty string.

How do I remove non ASCII characters in Excel?

Step 1: Click on any cell (D3). Enter Formula =CLEAN(C3). Step 2: Click ENTER. It removes non-printable characters.


2 Answers

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher") x #> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"  stringi::stri_trans_general(x, "latin-ascii") #> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher" 
like image 191
hadley Avatar answered Oct 21 '22 09:10

hadley


To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv Encoding(x) <- "latin1"  # (just to make sure) x # [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"  iconv(x, "latin1", "ASCII", sub="") # [1] "Ekstrm"        "Jreskog"       "bichen Zrcher" 

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters?  any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))) [1] TRUE  ## Find which lines (e.g. read in by readLines()) contain non-ASCII characters grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")) [1] 1 2 3 
like image 30
Josh O'Brien Avatar answered Oct 21 '22 09:10

Josh O'Brien