Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Docs exports spreadsheet values with commas. read.csv() in R treats these as factors instead of numeric

Tags:

r

csv

google-docs

I am new to R and am trying to read a public Google spreadsheet into an R data frame with numeric columns. My problem seems to be that the exported spreadsheet has commas in large numbers, such as "13,061.422". The read.csv() function treats this as a factor. I tried stringsAsFactors=FALSE and colClasses=c(rep("numeric",7)) but neither worked. Is there a way to coerce the values with commas and decimals to numeric values, either within read.csv() or afterwards when they are treated as Factors in the R dataframe? Here is my code:

require(RCurl)

myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0Agbdciapt4QZdE95UDFoNHlyNnl6aGlqbGF0cDIzTlE&single=true&gid=0&range=A1%3AG4928&output=csv", ssl.verifypeer=FALSE)  #ssl.verifypeer=FALSE gets around certificate issues I don't understand.

fullmatrix <- read.csv(textConnection(myCsv))

str(fullmatrix)

which results in:

'data.frame':   4927 obs. of  7 variables:
 $ wave.      : Factor w/ 4927 levels "1,000.8900","1,002.8190",..: 4875 4874 4873 4872 4871 4870 4869 4868 4867 4866 ...
 $ wavelength : Factor w/ 4927 levels "1,000.074","1,000.267",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ d2o        : num  85.2 87.7 86.3 87.6 85.6 ...
 $ di         : num  54.3 55.8 54.9 55.6 54.9 ...
 $ ddw        : num  48.2 49.7 49.4 50.2 49.6 ...
 $ ddw.old    : num  53.3 55 53.9 54.8 53.7 ...
 $ d2o.ddw.mix: num  65.8 67.9 67.2 68.4 66.8 ...

Thanks for any help! I am new to R, so guessing (hoping) this is an easy one!

like image 803
Steve Koch Avatar asked Dec 04 '22 18:12

Steve Koch


2 Answers

Yes. Two methods. The easiest to understand at first is probably just to is as.is=TRUE to preserve them as character vectors and then use gsub to remove the commas and any currency symbols before converting to numeric. The second is a bit more difficult, but I think more kewl. Create an as-method for the format you are using. Then you can use colClasses to do it in one step.

I see @EDi already did version #1 (using stringsAsFactors rather than as.is, so I will document strategy #2:

 library(methods)
 setClass("num.with.commas")
#[1] "num.with.commas"
 setAs("character", "num.with.commas",
      function(from) as.numeric(gsub(",", "", from)))
 require(RCurl)
#Loading required package: RCurl
#Loading required package: bitops

 myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0Agbdciapt4QZdE95UDFoNHlyNnl6aGlqbGF0cDIzTlE&single=true&gid=0&range=A1%3AG4928&output=csv", ssl.verifypeer=FALSE)  
> fullmatrix <- read.csv(textConnection(myCsv), 
                       colClasses=c(rep("num.with.commas",2), rep("numeric",4) ))
 str(fullmatrix)
#--------------
'data.frame':   4927 obs. of  7 variables:
 $ wave.      : num  9999 9997 9995 9993 9992 ...
 $ wavelength : num  1000 1000 1000 1001 1001 ...
 $ d2o        : num  85.2 87.7 86.3 87.6 85.6 ...
 $ di         : num  54.3 55.8 54.9 55.6 54.9 ...
 $ ddw        : num  48.2 49.7 49.4 50.2 49.6 ...
 $ ddw.old    : num  53.3 55 53.9 54.8 53.7 ...
 $ d2o.ddw.mix: num  65.8 67.9 67.2 68.4 66.8 ...

as-methods are coercive. There are many such methods in base R, such as as.list, as.numeric, as.character. In each case they attempt to take input that is in one mode and make a sensible copy of that in a different mode. For instance, it makes sense to coerce a matrix to a dataframe because they both have two dimensions. It makes a bit less sense to coerce a dataframe to a matrix (but it does succeed with loss of all the attributes of the columns and coercion to a common mode.)

In the present case I am taking a character string as input, removing any commas, and coercing the character values to numeric. Then I use read.table's ( in this case by way of read.csv) 'colClasses' argument to dispatch to the as-method I registered with setAs. You may want to go to the help(setAs) page for more details. The S4 class system confuses a lot of people, me included. This is about the only area of success I have had with S4 methods.

like image 77
IRTFM Avatar answered Dec 07 '22 09:12

IRTFM


Read the data with stringsAsFactors = FALSE in, remove the commas (with gsub()) and convert to numeric (with as.numeric()):

> fullmatrix <- read.csv(textConnection(myCsv), stringsAsFactors = FALSE)

> str(fullmatrix)
'data.frame':   4927 obs. of  7 variables:
 $ wave.      : chr  "9,999.2590" "9,997.3300" "9,995.4010" "9,993.4730" ...
 $ wavelength : chr  "1,000.07410549122" "1,000.26707130804" "1,000.46011160533" "1,000.65312629553" ...
 $ d2o        : num  85.2 87.7 86.3 87.6 85.6 ...
 $ di         : num  54.3 55.8 54.9 55.6 54.9 ...
 $ ddw        : num  48.2 49.7 49.4 50.2 49.6 ...
 $ ddw.old    : num  53.3 55 53.9 54.8 53.7 ...
 $ d2o.ddw.mix: num  65.8 67.9 67.2 68.4 66.8 ...

> fullmatrix$wave. <- as.numeric(gsub(",", "", fullmatrix$wave.)) 
> fullmatrix$wavelength <- as.numeric(gsub(",", "", fullmatrix$wavelength))

> str(fullmatrix)
'data.frame':   4927 obs. of  7 variables:
 $ wave.      : num  9999 9997 9995 9993 9992 ...
 $ wavelength : num  1000 1000 1000 1001 1001 ...
 $ d2o        : num  85.2 87.7 86.3 87.6 85.6 ...
 $ di         : num  54.3 55.8 54.9 55.6 54.9 ...
 $ ddw        : num  48.2 49.7 49.4 50.2 49.6 ...
 $ ddw.old    : num  53.3 55 53.9 54.8 53.7 ...
 $ d2o.ddw.mix: num  65.8 67.9 67.2 68.4 66.8 ...

> fullmatrix[1, 1]
[1] 9999.259
like image 42
EDi Avatar answered Dec 07 '22 08:12

EDi