I have what I think is a very simple question related to the use of data.table and the :=
function. I don't think I quite understand the behaviour of :=
and often I run into similar problems.
Here is some example data
mat <- structure(list( col1 = c(NA, 0, -0.015038, 0.003817, -0.011407), col2 = c(0.003745, 0.007463, -0.007407, -0.003731, -0.007491)), .Names = c("col1", "col2"), row.names = c(NA, 10L), class = c("data.table", "data.frame"))
which gives
> mat col1 col2 1: NA 0.003745 2: 0.000000 0.007463 3: -0.015038 -0.007407 4: 0.003817 -0.003731 5: -0.011407 -0.007491
I want to create a column called col3 which gives the sum of col1 and col2. If I use
mat[,col3 := col1 + col2] # col1 col2 col3 #1: NA 0.003745 NA #2: 0.000000 0.007463 0.007463 #3: -0.015038 -0.007407 -0.022445 #4: 0.003817 -0.003731 0.000086 #5: -0.011407 -0.007491 -0.018898
then I get an NA for the first row, but I want NAs to be ignored. So I tried instead
mat[,col3 := sum(col1,col2,na.rm=TRUE)] # col1 col2 col3 #1: NA 0.003745 -0.030049 #2: 0.000000 0.007463 -0.030049 #3: -0.015038 -0.007407 -0.030049 #4: 0.003817 -0.003731 -0.030049 #5: -0.011407 -0.007491 -0.030049
which is not what I am after, since it is giving me the sum of all elements of col1 and col2. I think I don't quite get :=
... How can I get the sum of the element of col1 and col2 ignoring NA values?
Not sure this is relevant, but here is my sessionInfo
> sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.3
To find the sum of non-missing values in an R data frame column, we can simply use sum function and set the na. rm to TRUE. For example, if we have a data frame called df that contains a column say x which has some missing values then the sum of the non-missing values can be found by using the command sum(df$x,na.
We can calculate the sum of multiple columns by using rowSums() and c() Function. we simply have to pass the name of the columns.
To find the row sums if NA exists in the R data frame, we can use rowSums function and set the na. rm argument to TRUE and this argument will remove NA values before calculating the row sums.
This is standard R
behaviour, nothing really to do with data.table
Adding anything to NA
will return NA
NA + 1 ## NA
sum
will return a single number
If you want 1 + NA
to return 1
then you will have to run something like
mat[,col3 := col1 + col2] mat[is.na(col1), col3 := col2] mat[is.na(col2), col3 := col1]
To deal with when col1
or col2
are NA
You could also use rowSums, which has a na.rm
argument
mat[ , col3 :=rowSums(.SD, na.rm = TRUE), .SDcols = c("col1", "col2")]
rowSums
is what you want (by definition, the rowSums
of a matrix containing col1
and col2
, removing NA
values
(@JoshuaUlrich suggested this as a comment )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With