adding successive four / n numbers in large matrix in R

Tags:

I have very large dataset with dimension of 60K x 4 K. I am trying add every four values in succession in every row column wise. The following is smaller example dataset.

    set.seed(123)
    mat <- matrix (sample(0:1, 48, replace = TRUE), 4)

   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    0    1    1    1    0    1    1    0    1     1     0     0
[2,]    1    0    0    1    0    1    1    0    1     0     0     0
[3,]    0    1    1    0    0    1    1    1    0     0     0     0
[4,]    1    1    0    1    1    1    1    1    0     0     0     0

Here is what I am trying to perform:

mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3

i.e. add every four values and output.

mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2

Keep going to end of matrix (here to 12).

mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]

Once first row is done apply the same to second row, like:

mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4] 
mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8]
mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]

The result will be nrow x (ncol)/4 matrix.

The expected result will look like:

          col1-col4      col5-8   col9-12
row1        3              2        2
row2        2              2        1
row3        2              3        0
row4        3              4        0

Similarly for row 3 to number of rows in the matrix. How can I efficiently loop this.

831

asked Aug 27 '14 18:08

SHRram

1 Answers

While Matthew's answer is really cool (+1, btw), you can get a much (~100x) faster solution if you avoid apply and use the *Sums functions (in this case colSums), and a bit of vector manipulation trickery:

funSums <- function(mat) {
  t.mat <- t(mat)                                    # rows become columns
  dim(t.mat) <- c(4, length(t.mat) / 4)              # wrap columns every four items (this is what we want to sum)
  t(matrix(colSums(t.mat), nrow=ncol(mat) / 4))      # sum our new 4 element columns, and reconstruct desired output format
}
set.seed(123)
mat <- matrix(sample(0:1, 48, replace = TRUE), 4)
funSums(mat)

Produces desired output:

     [,1] [,2] [,3]
[1,]    3    2    2
[2,]    2    2    1
[3,]    2    3    0
[4,]    3    4    0

Now, let's make something the real size and compare against the other options:

set.seed(123)
mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4)

funApply <- function(mat) {   # Matthew's Solution
  apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum)
}
funRcpp <- function(mat) {    # David's Solution
  roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1]
}
library(microbenchmark)
microbenchmark(times=10,
  funSums(mat),
  funApply(mat),
  funRcpp(mat)
)

Produces:

Unit: milliseconds
          expr        min         lq     median       uq       max neval
  funSums(mat)   4.035823   4.079707   5.256517   7.5359  42.06529    10
 funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960    10
  funRcpp(mat)  18.481184  20.364885  38.595383 106.0277 132.93382    10

And to check:

all.equal(funSums(mat), funApply(mat))
# [1] TRUE
all.equal(funSums(mat), funRcpp(mat))
# [1] TRUE

The key point is that the *Sums functions are fully "vectorized", in as much as all the calculations happen in C. apply still needs to do a bunch of not strictly vectorized (in the primitive C function way) stuff in R, and is slower (but far more flexible).

Specific to this problem, it might be possible to make it 2-3x faster as about half the time is spent on the transpositions, which are only necessary so that the dim changes do what I need for colSums to work.

196

answered Oct 09 '22 00:10

BrodieG

Related questions
                            
                                Count consecutive TRUE values within each block separately [duplicate]
                            
                                merge data frame and named vector
                            
                                How to sort rows of a data frame based on a vector using dplyr pipe
                            
                                Querying Oracle DB from Revolution R using RODBC
                            
                                Deleting specific rows from a data frame
                            
                                Paste together two character vectors of different lengths
                            
                                k-means: Same clusters for every execution
                            
                                Testing if rows of a matrix or data frame are sorted in R
                            
                                Import stuff from a R file
                            
                                R: find nearest index
                            
                                Fitting logarithmic curve in R
                            
                                Read.CSV not working as expected in R
                            
                                Generate combination of data frame and vector
                            
                                sapply with custom function (series of if statements)
                            
                                Convertic non-numeric factor to numeric column with mapping in R
                            
                                How can i rescale every column in my data frame to a 0-100 scale? (in r)
                            
                                Print number as word if less than 10
                            
                                Convert numbers to letters
                            
                                IPython notebook and rmagic/rpy2: cannot find module ri2py (OSX 10.8.5, python 2.7, R 3.1)
                            
                                read.sas7bdat unable to read compressed file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

adding successive four / n numbers in large matrix in R

Tags:

loops

r

large-data

SHRram

People also ask

1 Answers

BrodieG

Recent Activity

Donate For Us