Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Interpolation of NAs by group

I would like to perform a linear interpolation in a variable of a data frame which takes into account the: 1) time difference between the two points, 2) the moment when the data was taken and 3) the individual taken for measure the variable.

For example in the next dataframe:

 df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
            Individuals=c(1,1,1,1,1,1,1,2,2,2),
            Value=c(1, 2, 3, NA, 5, NA, 7, 5, NA, 7))
  df

I would like to obtain:

 result <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
                Individuals=c(1,1,1,1,1,1,1,2,2,2),
                Value=c(1, 2, 3, 4, 5, 6, 7, 5, 5.5, 6))
 result

I cannot use exclusively the function na.approx of the package zoo because all observations are not consecutives, some observations belong to one individual and other observations belong to other ones. The reason is because if the second individual would have its first obsrevation with NA and I would use exclusively the function na.approx, I would be using information from the individual==1 to interpolate the NA of the individual==2 (e.g the next data frame would have sucherror)

  df_2 <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
                Individuals=c(1,1,1,1,1,1,1,2,2,2),
                Value=c(1, 2, 3, NA, 5, NA, 7, NA, 5, 7))
  df_2

I have tried using the packages zoo and dplyr:

library(dplyr)
library(zoo)
proof <- df %>%
  group_by(Individuals) %>%
  na.approx(df$Value)

But I cannot perform group_by in a zoo object.

Do you know how to interpolate NA values in one variable by groups?

Thanks in advance,

like image 705
Ruben Avatar asked Nov 13 '15 15:11

Ruben


People also ask

What happened to Na_interpolation in R?

na.interpolation is replaced by na_interpolation . The functionality stays the same. The new name better fits modern R code style guidelines (which prefer _ over . in function names). na.interpolation (x, option = "linear", maxgap = Inf, ...) Numeric Vector ( vector) or Time Series ( ts ) object in which missing values shall be replaced

Why do your geospatial interpolation in R?

This is especially true in the geosciences, where sampling might require hundreds of thousands of Euros for drilling deep holes to the formation of your interest. Doing your geospatial interpolation in R is a great opportunity.

What is the'linear'option in the interpolation settings?

"linear" - for linear interpolation using approx (default choice) Maximum number of successive NAs to still perform imputation on. Default setting is to replace all NAs without restrictions. With this option set, consecutive NAs runs, that are longer than 'maxgap' will be left NA.

How to replace consecutive NAS that are longer than'maxgap'?

Default setting is to replace all NAs without restrictions. With this option set, consecutive NAs runs, that are longer than 'maxgap' will be left NA. This option mostly makes sense if you want to treat long runs of NA afterwards separately. ... Vector ( vector) or Time Series ( ts ) object (dependent on given input at parameter x)


2 Answers

Use data.frame, rather than cbind to create your data. cbind returns a matrix, but you need a data frame for dplyr. Then use na.approx inside mutate. I've commented out group_by, as you haven't provided the grouping variable in your data, but the approach should work once you've added the grouping variable to the data frame.

df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
            Individuals=c(1,1,1,1,1,1,1,2,2,2),
            Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10))

library(dplyr)
library(zoo)

df %>%
  group_by(Individuals) %>%
  mutate(ValueInterp = na.approx(Value, na.rm=FALSE))    
   time Individuals Value ValueInterp
1     1           1    NA          NA
2     2           1     2           2
3     3           1     3           3
4     4           1    NA           4
5     5           1     5           5
6     6           1    NA           6
7     7           1     7           7
8     1           2     8           8
9     2           2    NA           9
10    3           2    10          10

Update: To interpolate multiple columns, we can use mutate_at. Here's an example with two value columns. We use mutate_at to run na.approx on all columns that include "Value" in the column name. list(interp=na.approx) tells mutate_at to generate new column names by running na.approx and adding interp as a suffix to generate the new column names:

df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
                 Individuals=c(1,1,1,1,1,1,1,2,2,2),
                 Value1=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10),
                 Value2=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10)*2)

df %>%
  group_by(Individuals) %>%
  mutate_at(vars(matches("Value")), list(interp=na.approx), na.rm=FALSE)
    time Individuals Value1 Value2 Value1_interp Value2_interp
   <dbl>       <dbl>  <dbl>  <dbl>         <dbl>         <dbl>
 1     1           1     NA     NA            NA            NA
 2     2           1      2      4             2             4
 3     3           1      3      6             3             6
 4     4           1     NA     NA             4             8
 5     5           1      5     10             5            10
 6     6           1     NA     NA             6            12
 7     7           1      7     14             7            14
 8     1           2      8     16             8            16
 9     2           2     NA     NA             9            18
10     3           2     10     20            10            20

If you don't want to preserve the original, uninterpolated columns, you can do:

df %>%
  group_by(Individuals) %>%
  mutate_at(vars(matches("Value")), na.approx, na.rm=FALSE)
like image 65
eipi10 Avatar answered Oct 23 '22 03:10

eipi10


We can use data.table

library(data.table)
library(zoo)
setDT(df1)[, ValueInterp:= na.approx(Value, na.rm=TRUE), by = Individual]
like image 30
akrun Avatar answered Oct 23 '22 04:10

akrun