Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make this R matrix filling function faster?

Tags:

r

matrix

A while back I wrote a function to fill time series matrices that had NA values up according to the specifications needed and it's had occational uses on a few matrices that are about 50000 rows, 350 columns. The matrix can contain either numeric or character values. The main problem is fixing the matrix is slow and I thought I'd gauge some experts on how to do this faster.

I guess going to rcpp or paralleling it might help but I think it's might be my design rather than R itself that's inefficient. I generally vecotrize everything in R but since the missing values follow no pattern I've found no other way than to work with the matrix on a per row basis.

The function needs to be called so it can carry forwards missing values and also be called to quickly just fill the latest values with the last known one.

Here is an example matrix:

testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, 
 4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA, 
 NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA, 
 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, 
 -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, 
 NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39, 
 NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA, 
 4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L, 
 9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e", 
 "ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31", 
 "2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31", 
 "2011-04-30", "2011-05-31")))

print(testMatrix)
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98         NA      29.98      29.98      29.98      29.98         NA
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65         NA         NA         NA         NA     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03         NA       4.76       4.76       4.76       4.39       4.39       4.39         NA

This is the function I currently use:

# ----------------------------------------------------------------------------
# GetMatrixWithBlanksFilled
# ----------------------------------------------------------------------------
#
# Arguments:
# inputMatrix --- A matrix with gaps in the time series rows
# fillGapMax  --- The max number of columns to carry a number
#                 forward if there are no more values in the
#                 time series row.
#
# Returns:
# A matrix with gaps filled.

GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) {

    if("DEBUG_ON" %in% ls(globalenv())){browser()}

    cntRow <- nrow(inputMatrix)
    cntCol <- ncol(inputMatrix)

    # 
    if (forwardLooking) {
        for (i in 1:cntRow) {
            # Store the location of the first non NA element in the row
            firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1]
            if (!(is.na(firstValueCol))) {
                if (!(firstValueCol == cntCol)) {
                    nextValueCol <- firstValueCol
                    # If there is a a value number in the row and it's not at the end of the time
                    # series, start iterating through the row while there are more NA values and
                    # more data values and not at the end of the row continue.
                    while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) {
                        # Find the next NA element
                        nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1]
                        # Find the next value element
                        nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1]
                        # If there is another value element then fill up all NA elements in between with the last known value
                        if (!is.na(nextValueCol)) {
                            inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)]
                        } else {
                            # If there is no other value element then fill up all NA elements up to the max number supplied
                            # with the last known value unless it's close to the end of the row then just fill up to the end.
                            inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)]
                            nextValueCol <- cntCol
                        }
                    }
                }
            }
        }
    } else {
        for (i in 1:cntRow) {
            if (is.na(inputMatrix[i,ncol(inputMatrix)])) {
                tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])]
                if (length(tempRow[!is.na(tempRow)])>0) {
                    lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
                    inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
                }
            }
        }
    }

    return(inputMatrix)
}

I'm then calling it with something like:

> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE)
> print(fixedMatrix1)
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03       4.03       4.76       4.76       4.76       4.39       4.39       4.39       4.39

or

> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE)
> print(fixedMatrix2)
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98         NA      29.98      29.98      29.98      29.98      29.98
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65         NA         NA         NA         NA     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03         NA       4.76       4.76       4.76       4.39       4.39       4.39       4.39

This example runs quickly but is there any way to make it faster for large matrices?

> n <- 38
> m <- 5000
> bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE)
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
   user  system elapsed 
  86.47    0.06   87.24

This dummy one has a lot of NA only rows and completely filled ones but the normal ones can take about 15-20 min.

UPDATE

Regarding Charles' comment about na.locf not completely mirroring the logic of the above: Below is a simplified version of how the final function is excluding checks for inputs etc:

FillGaps <- function( dataMatrix, fillGapMax ) {

    require("zoo")

    numRow <- nrow(dataMatrix) 
    numCol <- ncol(dataMatrix) 

    iteration <- (numCol-fillGapMax)

    if(length(iteration)>0) {
        for (i in iteration:1) {
            tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE]
            tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax))
            dataMatrix[,i:(i+fillGapMax)] <- tempMatrix
        }
    }

    return(dataMatrix)
}
like image 929
Hansi Avatar asked Jun 16 '11 10:06

Hansi


People also ask

Is matrix faster than Dataframe in R?

Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.

What is the limit to matrix in R?

The theoretical limit of a vector in R is 2147483647 elements. So that's about 1 billion rows / 2 columns.

How do you fill a row wise matrix?

We can fill a matrix "row-wise" by: Setting the upper left element to 1 and incrementing successive elements of the first row by 1 from left to right. When a row is filled, move to the next row and repeat, starting with the last element value plus 1.

How do you solve a matrix in R?

The solve() Function in R R programming has a dedicated function to do the task for you. The solve() function in R programming takes a matrix as an argument and then returns the inverse of that matrix.


1 Answers

I might be wrong but I think this is implemented in the zoo package: use the na.locf function.

With your given example matrix, first we should transpose it, and after calling the na function we 'retranspose' the result matrix. E.g.:

> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=12))
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03       4.03       4.76       4.76       4.76       4.39       4.39       4.39       4.39

And with small maxgap:

> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=0))
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98         NA      29.98      29.98      29.98      29.98         NA
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65         NA         NA         NA         NA     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03         NA       4.76       4.76       4.76       4.39       4.39       4.39         NA

The performance gained using na.locf could be seen:

>  system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
   user  system elapsed 
 79.238   0.540  80.398 
> system.time(output <- t(na.locf(t(bigM), na.rm=FALSE, maxgap=12)))
   user  system elapsed 
 17.129   0.267  17.513 
like image 174
daroczig Avatar answered Oct 04 '22 22:10

daroczig