I would like to perform a linear interpolation in a variable of a data frame which takes into account the: 1) time difference between the two points, 2) the moment when the data was taken and 3) the individual taken for measure the variable.
For example in the next dataframe:
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, NA, 5, NA, 7, 5, NA, 7))
df
I would like to obtain:
result <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, 4, 5, 6, 7, 5, 5.5, 6))
result
I cannot use exclusively the function na.approx
of the package zoo
because all observations are not consecutives, some observations belong to one individual and other observations belong to other ones. The reason is because if the second individual would have its first obsrevation with NA
and I would use exclusively the function na.approx
, I would be using information from the individual==1
to interpolate the NA
of the individual==2
(e.g the next data frame would have sucherror)
df_2 <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, NA, 5, NA, 7, NA, 5, 7))
df_2
I have tried using the packages zoo
and dplyr
:
library(dplyr)
library(zoo)
proof <- df %>%
group_by(Individuals) %>%
na.approx(df$Value)
But I cannot perform group_by
in a zoo
object.
Do you know how to interpolate NA
values in one variable by groups?
Thanks in advance,
na.interpolation is replaced by na_interpolation . The functionality stays the same. The new name better fits modern R code style guidelines (which prefer _ over . in function names). na.interpolation (x, option = "linear", maxgap = Inf, ...) Numeric Vector ( vector) or Time Series ( ts ) object in which missing values shall be replaced
This is especially true in the geosciences, where sampling might require hundreds of thousands of Euros for drilling deep holes to the formation of your interest. Doing your geospatial interpolation in R is a great opportunity.
"linear" - for linear interpolation using approx (default choice) Maximum number of successive NAs to still perform imputation on. Default setting is to replace all NAs without restrictions. With this option set, consecutive NAs runs, that are longer than 'maxgap' will be left NA.
Default setting is to replace all NAs without restrictions. With this option set, consecutive NAs runs, that are longer than 'maxgap' will be left NA. This option mostly makes sense if you want to treat long runs of NA afterwards separately. ... Vector ( vector) or Time Series ( ts ) object (dependent on given input at parameter x)
Use data.frame
, rather than cbind
to create your data. cbind
returns a matrix, but you need a data frame for dplyr
. Then use na.approx
inside mutate
. I've commented out group_by
, as you haven't provided the grouping variable in your data, but the approach should work once you've added the grouping variable to the data frame.
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10))
library(dplyr)
library(zoo)
df %>%
group_by(Individuals) %>%
mutate(ValueInterp = na.approx(Value, na.rm=FALSE))
time Individuals Value ValueInterp 1 1 1 NA NA 2 2 1 2 2 3 3 1 3 3 4 4 1 NA 4 5 5 1 5 5 6 6 1 NA 6 7 7 1 7 7 8 1 2 8 8 9 2 2 NA 9 10 3 2 10 10
Update: To interpolate multiple columns, we can use mutate_at
. Here's an example with two value columns. We use mutate_at
to run na.approx
on all columns that include "Value"
in the column name. list(interp=na.approx)
tells mutate_at
to generate new column names by running na.approx
and adding interp
as a suffix to generate the new column names:
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value1=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10),
Value2=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10)*2)
df %>%
group_by(Individuals) %>%
mutate_at(vars(matches("Value")), list(interp=na.approx), na.rm=FALSE)
time Individuals Value1 Value2 Value1_interp Value2_interp <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 NA NA NA NA 2 2 1 2 4 2 4 3 3 1 3 6 3 6 4 4 1 NA NA 4 8 5 5 1 5 10 5 10 6 6 1 NA NA 6 12 7 7 1 7 14 7 14 8 1 2 8 16 8 16 9 2 2 NA NA 9 18 10 3 2 10 20 10 20
If you don't want to preserve the original, uninterpolated columns, you can do:
df %>%
group_by(Individuals) %>%
mutate_at(vars(matches("Value")), na.approx, na.rm=FALSE)
We can use data.table
library(data.table)
library(zoo)
setDT(df1)[, ValueInterp:= na.approx(Value, na.rm=TRUE), by = Individual]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With