Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing data with special characters in R

The following pic shows how the data is before i import it(notepad) in R and after importing.

enter image description here

I use the following command to import it in R:

Data <- read.csv('data.csv',stringsAsFactors = FALSE,header = TRUE,quote = "")

It can be seen that the special characters such as the ae is replaced with something like A| (line 19 on the left,line 18 or the right). Is there a way to import the CSV file as it is? (Using R)

like image 313
Mpizos Dimitris Avatar asked Nov 13 '15 14:11

Mpizos Dimitris


1 Answers

Your problem is an encoding issue. There are two aspects to this: First, what is saved by Notepad++ may not correspond to the encoding that you are expecting in the saved text file, and second, R may be reading the file in using read.csv() based on a different encoding, which is especially possible since if you are using Notepad++ then this suggests you are using Windows, and therefore you may be unable to have UTF-8 as your system locale for R.

So taking each issue in turn:

  1. Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.

    To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)

  2. Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:

    > Sys.getlocale()
    [1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
    

    You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.

like image 194
Ken Benoit Avatar answered Nov 15 '22 17:11

Ken Benoit