A while back I wrote a function to fill time series matrices that had NA values up according to the specifications needed and it's had occational uses on a few matrices that are about 50000 rows, 350 columns. The matrix can contain either numeric or character values. The main problem is fixing the matrix is slow and I thought I'd gauge some experts on how to do this faster.
I guess going to rcpp or paralleling it might help but I think it's might be my design rather than R itself that's inefficient. I generally vecotrize everything in R but since the missing values follow no pattern I've found no other way than to work with the matrix on a per row basis.
The function needs to be called so it can carry forwards missing values and also be called to quickly just fill the latest values with the last known one.
Here is an example matrix:
testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA,
4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA,
NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA,
66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA,
-12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA,
NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39,
NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA,
4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L,
9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e",
"ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31",
"2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31",
"2011-04-30", "2011-05-31")))
print(testMatrix)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA
This is the function I currently use:
# ----------------------------------------------------------------------------
# GetMatrixWithBlanksFilled
# ----------------------------------------------------------------------------
#
# Arguments:
# inputMatrix --- A matrix with gaps in the time series rows
# fillGapMax --- The max number of columns to carry a number
# forward if there are no more values in the
# time series row.
#
# Returns:
# A matrix with gaps filled.
GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) {
if("DEBUG_ON" %in% ls(globalenv())){browser()}
cntRow <- nrow(inputMatrix)
cntCol <- ncol(inputMatrix)
#
if (forwardLooking) {
for (i in 1:cntRow) {
# Store the location of the first non NA element in the row
firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1]
if (!(is.na(firstValueCol))) {
if (!(firstValueCol == cntCol)) {
nextValueCol <- firstValueCol
# If there is a a value number in the row and it's not at the end of the time
# series, start iterating through the row while there are more NA values and
# more data values and not at the end of the row continue.
while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) {
# Find the next NA element
nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1]
# Find the next value element
nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1]
# If there is another value element then fill up all NA elements in between with the last known value
if (!is.na(nextValueCol)) {
inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)]
} else {
# If there is no other value element then fill up all NA elements up to the max number supplied
# with the last known value unless it's close to the end of the row then just fill up to the end.
inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)]
nextValueCol <- cntCol
}
}
}
}
}
} else {
for (i in 1:cntRow) {
if (is.na(inputMatrix[i,ncol(inputMatrix)])) {
tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])]
if (length(tempRow[!is.na(tempRow)])>0) {
lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
}
}
}
}
return(inputMatrix)
}
I'm then calling it with something like:
> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE)
> print(fixedMatrix1)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39
or
> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE)
> print(fixedMatrix2)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39
This example runs quickly but is there any way to make it faster for large matrices?
> n <- 38
> m <- 5000
> bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE)
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
86.47 0.06 87.24
This dummy one has a lot of NA only rows and completely filled ones but the normal ones can take about 15-20 min.
UPDATE
Regarding Charles' comment about na.locf not completely mirroring the logic of the above: Below is a simplified version of how the final function is excluding checks for inputs etc:
FillGaps <- function( dataMatrix, fillGapMax ) {
require("zoo")
numRow <- nrow(dataMatrix)
numCol <- ncol(dataMatrix)
iteration <- (numCol-fillGapMax)
if(length(iteration)>0) {
for (i in iteration:1) {
tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE]
tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax))
dataMatrix[,i:(i+fillGapMax)] <- tempMatrix
}
}
return(dataMatrix)
}
Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.
The theoretical limit of a vector in R is 2147483647 elements. So that's about 1 billion rows / 2 columns.
We can fill a matrix "row-wise" by: Setting the upper left element to 1 and incrementing successive elements of the first row by 1 from left to right. When a row is filled, move to the next row and repeat, starting with the last element value plus 1.
The solve() Function in R R programming has a dedicated function to do the task for you. The solve() function in R programming takes a matrix as an argument and then returns the inverse of that matrix.
I might be wrong but I think this is implemented in the zoo package: use the na.locf
function.
With your given example matrix, first we should transpose it, and after calling the na
function we 'retranspose' the result matrix. E.g.:
> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=12))
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39
And with small maxgap
:
> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=0))
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA
The performance gained using na.locf
could be seen:
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
79.238 0.540 80.398
> system.time(output <- t(na.locf(t(bigM), na.rm=FALSE, maxgap=12)))
user system elapsed
17.129 0.267 17.513
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With