Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use data.table set() to convert all columns from integer to numeric

Tags:

r

data.table

I am working with a data.table that has 1900 columns and roughly 280,000 rows.

Currently, the data is entirely "integer", but I want them to explicitly "numeric" so I can pass it to a bigcor() function later. Apparently, bigcor() can only handle "numeric" and not "integer".

I have tried:

full.bind <- full.bind[,sapply(full.bind, as.numeric), with=FALSE]

Unfortunately, I get the error:

Error in `[.data.table`(full.bind, , sapply(full.bind, as.numeric), with = FALSE) : 
  j out of bounds

So, I tried using the data.table set() function, but I get the error:

Error in set(full.bind, value = as.numeric(full.bind)) : 
  (list) object cannot be coerced to type 'double'

I have created a simple reproducible example. Keep in mind, the actual columns are NOT "a", "b", or "c"; they are extremely complicated column names so referencing column individually is not a possibility.

dt <- data.table(a=1:10, b=1:10, c=1:10)

So, my final questions are:

1) Why does my sapply technique not work? (what is the "j out of bounds" error?) 2) Why does the set() technique not? (why can't the data.table be coerced to numeric?) 3) Does the bigcor() function require a numeric object, or is there another problem?

like image 465
Pablo Boswell Avatar asked Apr 22 '15 07:04

Pablo Boswell


1 Answers

Use .SD and assignment by reference:

library(data.table)
dt <- data.table(a=1:10, b=1:10, c=1:10)
sapply(dt, class)
#        a         b         c 
#"integer" "integer" "integer"

dt[, names(dt) := lapply(.SD, as.numeric)]
sapply(dt, class)
#        a         b         c 
#"numeric" "numeric" "numeric"

set only works for one column here (note the documentation, which doesn't say that j is optional), because each replacement column has to be generated. You would need to loop over the columns (e.g., using a for loop) if you want to use it. It might be preferable because it needs less memory (additional memory need corresponds to one column whereas additional memory for the whole data.table is needed with the first approach).

for (k in seq_along(dt)) set(dt, j = k, value = as.character(dt[[k]]))
sapply(dt, class)
#         a           b           c 
#"character" "character" "character"

However, bigcor (from package propagate) requires a matrix as input and a data.table isn't a matrix. So, your problem is not the column type, but you need to use as.matrix(dt).

like image 159
Roland Avatar answered Oct 07 '22 16:10

Roland