I have exported data from a result grid in SQL Server Management Studio to a csv file. The csv file looks correct.
But when I read the data into an R dataframe using read.csv, the first column name is prepended with "ï..". How do I get rid of this junk text?
Example:
str(trainData) 'data.frame': 64169 obs. of 20 variables: $ ï..Column1 : int 3232... $ Column2 : int 4242...
The data looks something like this (nothing special) :
Column1,Column2
100116577,100116577
100116698,100116702
It is the byte order mark (or BOM) and it's telling the computer that the characters that follow are encoded in Unicode. However, text editors might interpret this character as something else: namely .
To load a. csv file into the current script and operate with it, use the read. csv() method in base R. The output is delivered as a data frame, with row numbers given to integers starting at 1.
You've got a Unicode UTF-8 BOM at the start of the file:
http://en.wikipedia.org/wiki/Byte_order_mark
A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this
R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.
Here:
http://r.789695.n4.nabble.com/Writing-Unicode-Text-into-Text-File-from-R-in-Windows-td4684693.html
Duncan Murdoch suggests:
You can declare a file to be in encoding "UTF-8-BOM" if you want to ignore a BOM on input
So try your read.csv
with fileEncoding="UTF-8-BOM"
or persuade your SQL wotsit to not output a BOM.
Otherwise you may as well test if the first name starts with ï..
and strip it with substr
(as long as you know you'll never have a column that does start like that genuinely...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With