Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpolate NA values in a data frame with na.approx

I am trying to remove NAs from my data frame by interpolation with na.approx() but can't remove all of the NAs.

My data frame is a 4096x4096 with 270.15 as flag for non valid value. I need data to be continous in all points to feed a meteorological model. Yesterday I asked, and obtained an answer, on how to replace values in a data frame based in another data frame. But after that I came to na.approx() and then decided to replace the 270.15 values with NA and try na.approx() to interpolate data. But the question is why na.approx() does not replace all NAs.

This is what I am doing:

  • Read the original hdf file with hdf5load
  • Subset the data frame (4094x4096)
  • Substitute flag value with NA

    > sst4[sst4 == 270.15 ] = NA
    
  • Check first column (or any other)

    > summary(sst4[,1])
    
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
    271.3   276.4   285.9   285.5   292.3   302.8  1345.0
    
  • Run na.approx

    > sst4=na.approx(sst4,na.rm="FALSE")
    
  • Check first column

    > summary(sst4[,1]) 
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
    271.3   276.5   286.3   285.9   292.6   302.8   411.0
    

As you can see 411 NA's have not been removed. Why? Do they all correspond to leading/ending column values?

head(sst4[,1])
[1] NA NA NA NA NA NA
tail(sst4[,1])
[1] NA NA NA NA NA NA

Is it needed by na.approx to have valid values before and after NA to interpolate? Do I need to set any other na.approx option?

Thank you very much

like image 223
pacomet Avatar asked Sep 06 '11 09:09

pacomet


People also ask

How does NA approx work?

An object of similar structure as object with NA s replaced by interpolation. For na. approx only the internal NA s are replaced and leading or trailing NA s are omitted if na. rm = TRUE or not replaced if na.

How do you omit Na in a data frame?

To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).

How do you linearly interpolate?

Know the formula for the linear interpolation process. The formula is y = y1 + ((x – x1) / (x2 – x1)) * (y2 – y1), where x is the known value, y is the unknown value, x1 and y1 are the coordinates that are below the known x value, and x2 and y2 are the coordinates that are above the x value.

How do you set NA to zero?

To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0. myDataframe is the data frame in which you would like replace all NAs with 0.


3 Answers

A small, reproducible example:

library(zoo) set.seed(1) m <- matrix(runif(16, 0, 100), nrow = 4) missing_values <- sample(16, 7) m[missing_values] <- NA m          [,1]     [,2]      [,3]     [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239       NA  6.178627 38.41037 [3,]       NA       NA        NA       NA [4,] 90.82078 66.07978        NA       NA  na.approx(m)          [,1]     [,2]      [,3]     [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 35.47206  6.178627 38.41037 [3,] 64.01658 50.77592        NA       NA [4,] 90.82078 66.07978        NA       NA  m[4, 4] <- 50 na.approx(m)          [,1]     [,2]      [,3]     [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 35.47206  6.178627 38.41037 [3,] 64.01658 50.77592        NA 44.20519 [4,] 90.82078 66.07978        NA 50.00000 

Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn't work. Can you guess values for your boundaries?

ANOTHER EDIT: So by default, you need the start and end values of columns to be known. However it is possible to get na.approx to always fill in the blanks by passing rule = 2. See Felix's answer. You can also use na.fill to provide a default value, as per Gabor's comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions.


EDIT: A further thought. Since na.approx is only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average.

na.approx fails when whole columns are NA, so we create a bigger dataset.

set.seed(1) m <- matrix(runif(64, 0, 100), nrow = 8) missing_values <- sample(64, 15) m[missing_values] <- NA 

Run na.approx both ways.

by_col <- na.approx(m) by_row <- t(na.approx(t(m))) 

Find out the best guess.

default <- 50 best_guess <- ifelse(is.na(by_row),    ifelse(     is.na(by_col),      default,              #neither known     by_col                #only by_col known   ),    ifelse(     is.na(by_col),      by_row,               #only by_row known     (by_row + by_col) / 2 #both known   ) ) 
like image 98
Richie Cotton Avatar answered Sep 23 '22 20:09

Richie Cotton


na.approx() follows the approx() function in only interpolating values, not extrapolating them, by default. However, as described in the help page for approx(), you can specify rule = 2 to extrapolate as a constant value of the nearest extreme. Following on from Richie Cotton's example:

na.approx(m, rule = 2)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592  6.178627 38.41037
[4,] 90.82078 66.07978  6.178627 38.41037

Equivalently, you can use "last observation carry forward" explicitly.

na.locf(na.approx(m))
## "first observation carry backwards" too:
na.locf(na.locf(na.approx(m)), fromLast = TRUE)
like image 28
Felix Andrews Avatar answered Sep 22 '22 20:09

Felix Andrews


I think you should try to set na.rm=TRUE

From the docs

na.rm logical. Should leading NAs be removed?

http://www.oga-lab.net/RGM2/func.php?rd_id=zoo:na.approx

like image 33
Henrik Avatar answered Sep 21 '22 20:09

Henrik