I am trying to remove <code>NA</code>s from my data frame by interpolation with <code>na.approx()</code> but can't remove all of the <code>NA</code>s. My data frame is a 4096x4096 with 270.15 as flag for non valid value. I need data to be continous in all points to feed a meteorological model. Yesterday I asked, and obtained an answer, on how to replace values in a data frame based in another data frame. But after that I came to <code>na.approx()</code> and then decided to replace the 270.15 values with <code>NA</code> and try <code>na.approx()</code> to interpolate data. But the question is why <code>na.approx()</code> does not replace all NAs. This is what I am doing: <ul> <li>Read the original hdf file with hdf5load</li> <li>Subset the data frame (4094x4096)</li> <li> Substitute flag value with NA <pre class="prettyprint"><code>> sst4[sst4 == 270.15 ] = NA </code></pre> </li> <li> Check first column (or any other) <pre class="prettyprint"><code>> summary(sst4[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 271.3 276.4 285.9 285.5 292.3 302.8 1345.0 </code></pre> </li> <li> Run na.approx <pre class="prettyprint"><code>> sst4=na.approx(sst4,na.rm="FALSE") </code></pre> </li> <li> Check first column <pre class="prettyprint"><code>> summary(sst4[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 271.3 276.5 286.3 285.9 292.6 302.8 411.0 </code></pre> </li> </ul> As you can see 411 NA's have not been removed. Why? Do they all correspond to leading/ending column values? <pre class="prettyprint"><code>head(sst4[,1]) [1] NA NA NA NA NA NA tail(sst4[,1]) [1] NA NA NA NA NA NA </code></pre> Is it needed by na.approx to have valid values before and after NA to interpolate? Do I need to set any other na.approx option? Thank you very much

A small, reproducible example: <pre class="prettyprint"><code>library(zoo) set.seed(1) m <- matrix(runif(16, 0, 100), nrow = 4) missing_values <- sample(16, 7) m[missing_values] <- NA m [,1] [,2] [,3] [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 NA 6.178627 38.41037 [3,] NA NA NA NA [4,] 90.82078 66.07978 NA NA na.approx(m) [,1] [,2] [,3] [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 35.47206 6.178627 38.41037 [3,] 64.01658 50.77592 NA NA [4,] 90.82078 66.07978 NA NA m[4, 4] <- 50 na.approx(m) [,1] [,2] [,3] [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 35.47206 6.178627 38.41037 [3,] 64.01658 50.77592 NA 44.20519 [4,] 90.82078 66.07978 NA 50.00000 </code></pre> Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn't work. Can you guess values for your boundaries? ANOTHER EDIT: So by default, you need the start and end values of columns to be known. However it is possible to get <code>na.approx</code> to always fill in the blanks by passing <code>rule = 2</code>. See Felix's answer. You can also use <code>na.fill</code> to provide a default value, as per Gabor's comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions. <hr> EDIT: A further thought. Since <code>na.approx</code> is only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average. <code>na.approx</code> fails when whole columns are <code>NA</code>, so we create a bigger dataset. <pre class="prettyprint"><code>set.seed(1) m <- matrix(runif(64, 0, 100), nrow = 8) missing_values <- sample(64, 15) m[missing_values] <- NA </code></pre> Run <code>na.approx</code> both ways. <pre class="prettyprint"><code>by_col <- na.approx(m) by_row <- t(na.approx(t(m))) </code></pre> Find out the best guess. <pre class="prettyprint"><code>default <- 50 best_guess <- ifelse(is.na(by_row), ifelse( is.na(by_col), default, #neither known by_col #only by_col known ), ifelse( is.na(by_col), by_row, #only by_row known (by_row + by_col) / 2 #both known ) ) </code></pre>

<code>na.approx()</code> follows the <code>approx()</code> function in only interpolating values, not extrapolating them, by default. However, as described in the help page for <code>approx()</code>, you can specify <code>rule = 2</code> to extrapolate as a constant value of the nearest extreme. Following on from Richie Cotton's example: <pre class="prettyprint"><code>na.approx(m, rule = 2) [,1] [,2] [,3] [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 35.47206 6.178627 38.41037 [3,] 64.01658 50.77592 6.178627 38.41037 [4,] 90.82078 66.07978 6.178627 38.41037 </code></pre> Equivalently, you can use "last observation carry forward" explicitly. <pre class="prettyprint"><code>na.locf(na.approx(m)) ## "first observation carry backwards" too: na.locf(na.locf(na.approx(m)), fromLast = TRUE) </code></pre>

I think you should try to set <code>na.rm=TRUE</code> <blockquote> From the docs na.rm logical. Should leading NAs be removed? </blockquote> http://www.oga-lab.net/RGM2/func.php?rd_id=zoo:na.approx

Interpolate NA values in a data frame with na.approx

Tags:

dataframe

r

interpolation

I am trying to remove NAs from my data frame by interpolation with na.approx() but can't remove all of the NAs.

My data frame is a 4096x4096 with 270.15 as flag for non valid value. I need data to be continous in all points to feed a meteorological model. Yesterday I asked, and obtained an answer, on how to replace values in a data frame based in another data frame. But after that I came to na.approx() and then decided to replace the 270.15 values with NA and try na.approx() to interpolate data. But the question is why na.approx() does not replace all NAs.

This is what I am doing:

Read the original hdf file with hdf5load
Subset the data frame (4094x4096)
Substitute flag value with NA
```
> sst4[sst4 == 270.15 ] = NA
```

Check first column (or any other)

> summary(sst4[,1])

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
271.3   276.4   285.9   285.5   292.3   302.8  1345.0

Run na.approx
```
> sst4=na.approx(sst4,na.rm="FALSE")
```

Check first column

> summary(sst4[,1]) 
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
271.3   276.5   286.3   285.9   292.6   302.8   411.0

As you can see 411 NA's have not been removed. Why? Do they all correspond to leading/ending column values?

head(sst4[,1])
[1] NA NA NA NA NA NA
tail(sst4[,1])
[1] NA NA NA NA NA NA

Is it needed by na.approx to have valid values before and after NA to interpolate? Do I need to set any other na.approx option?

Thank you very much

223

asked Sep 06 '11 09:09

pacomet

3 Answers

A small, reproducible example:

library(zoo) set.seed(1) m <- matrix(runif(16, 0, 100), nrow = 4) missing_values <- sample(16, 7) m[missing_values] <- NA m          [,1]     [,2]      [,3]     [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239       NA  6.178627 38.41037 [3,]       NA       NA        NA       NA [4,] 90.82078 66.07978        NA       NA  na.approx(m)          [,1]     [,2]      [,3]     [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 35.47206  6.178627 38.41037 [3,] 64.01658 50.77592        NA       NA [4,] 90.82078 66.07978        NA       NA  m[4, 4] <- 50 na.approx(m)          [,1]     [,2]      [,3]     [,4] [1,] 26.55087 20.16819 62.911404 68.70228 [2,] 37.21239 35.47206  6.178627 38.41037 [3,] 64.01658 50.77592        NA 44.20519 [4,] 90.82078 66.07978        NA 50.00000

Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn't work. Can you guess values for your boundaries?

ANOTHER EDIT: So by default, you need the start and end values of columns to be known. However it is possible to get na.approx to always fill in the blanks by passing rule = 2. See Felix's answer. You can also use na.fill to provide a default value, as per Gabor's comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions.

EDIT: A further thought. Since na.approx is only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average.

na.approx fails when whole columns are NA, so we create a bigger dataset.

set.seed(1) m <- matrix(runif(64, 0, 100), nrow = 8) missing_values <- sample(64, 15) m[missing_values] <- NA

Run na.approx both ways.

by_col <- na.approx(m) by_row <- t(na.approx(t(m)))

Find out the best guess.

default <- 50 best_guess <- ifelse(is.na(by_row),    ifelse(     is.na(by_col),      default,              #neither known     by_col                #only by_col known   ),    ifelse(     is.na(by_col),      by_row,               #only by_row known     (by_row + by_col) / 2 #both known   ) )

answered Sep 23 '22 20:09

Richie Cotton

na.approx() follows the approx() function in only interpolating values, not extrapolating them, by default. However, as described in the help page for approx(), you can specify rule = 2 to extrapolate as a constant value of the nearest extreme. Following on from Richie Cotton's example:

na.approx(m, rule = 2)
         [,1]     [,2]      [,3]     [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206  6.178627 38.41037
[3,] 64.01658 50.77592  6.178627 38.41037
[4,] 90.82078 66.07978  6.178627 38.41037

Equivalently, you can use "last observation carry forward" explicitly.

na.locf(na.approx(m))
## "first observation carry backwards" too:
na.locf(na.locf(na.approx(m)), fromLast = TRUE)

answered Sep 22 '22 20:09

Felix Andrews

I think you should try to set na.rm=TRUE

From the docs

na.rm logical. Should leading NAs be removed?

http://www.oga-lab.net/RGM2/func.php?rd_id=zoo:na.approx

answered Sep 21 '22 20:09

Henrik

Related questions
                            
                                How to set the ranges of the values taken by ggplot2 stat_smooth() to fits lines?
                            
                                order while splitting (eg. TA should be split to two column "A" in first "T" second) in r
                            
                                How to create a stacked bar chart from summarized data in ggplot2
                            
                                Count item pairs linked by column value
                            
                                How to name sections on x axis that are separated by vertical lines in an R plot (package ggplot2)?
                            
                                Oauth authentification to Fitbit using httr
                            
                                Parsing Deeply Nested JSON Structures in R Using RJSONIO
                            
                                Adding Different Percentiles in boxplots in R
                            
                                stat_bin2d with fill based on success rate
                            
                                Reading URL in R and RStudio
                            
                                In R, how can I test if two factors are equivalent?
                            
                                Exception handling and stack unwinding in R
                            
                                Consistent graph size in R using ggplot2 (legend and axis change the size)
                            
                                Find a word before one of two possible separators
                            
                                How to create dummy variables?
                            
                                Why does approx return a list rather than a data frame or array?
                            
                                In R, is there a way to color plot points on a gradient based on a range of numbers?
                            
                                Draw a function in ggplot2 with more than x as parameter
                            
                                Convert ggplot object to plotly in shiny application
                            
                                Convert a month abbreviation to a numeric month, in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With