Transform NA values based on first registration and nearest values

Tags:

na

I already made a similar question but now I want just to restrict the new values of NA.

I have some data like this:

Date 1   Date 2    Date 3    Date 4    Date 5   Date 6
A  NA       0.1       0.2       NA        0.3    0.2
B  0.1      NA        NA        0.3       0.2    0.1
C  NA       NA        NA        NA        0.3    NA
D  0.1      0.2       0.3       NA        0.1    NA
E  NA       NA        0.1       0.2       0.1    0.3

I would like to change the NA values of my data based on the first date a value is registered. So for example for A, the first registration is Date 2. Then I want that before that registration the values of NA in A are 0, and after the first registration the values of NA become the mean of the nearest values (mean of date 3 and 5).

In case the last value is an NA, transform it into the last registered value (as in C and D). In the case of E all NA values will become 0.

Get something like this:

Date 1   Date 2    Date 3    Date 4    Date 5   Date 6 
A  0       0.1       0.2        0.25      0.3    0.2
B  0.1     0.2       0.2        0.3       0.2    0.1
C  0       0         0          0         0.3    0.3
D  0.1     0.2       0.3        0.2       0.1    0.1
E  0       0         0.1        0.2       0.1    0.3

Can you help me? I'm not sure how to do it in R.

872

asked Jan 10 '19 14:01

user195366

2 Answers

Here is a way using na.approx from the zoo package and apply with MARGIN = 1 (so this is probably not very efficient but get's the job done).

library(zoo)
df1 <- as.data.frame(t(apply(dat, 1, na.approx, method = "constant", f = .5, na.rm = FALSE)))

This results in

df1
#   V1  V2  V3   V4  V5
#A  NA 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C  NA  NA  NA   NA 0.3
#E  NA  NA 0.1 0.20 0.1

Replace NAs and rename columns.

df1[is.na(df1)] <- 0
names(df1) <- names(dat)
df1
#  Date_1 Date_2 Date_3 Date_4 Date_5
#A    0.0    0.1    0.2   0.25    0.3
#B    0.1    0.2    0.2   0.30    0.2
#C    0.0    0.0    0.0   0.00    0.3
#E    0.0    0.0    0.1   0.20    0.1

explanation

Given a vector

x <- c(0.1, NA, NA, 0.3, 0.2)
na.approx(x)

returns x with linear interpolated values

#[1] 0.1000000 0.1666667 0.2333333 0.3000000 0.2000000

But OP asked for constant values so we need the argument method = "constant" from the approx function.

na.approx(x, method = "constant") 
# [1] 0.1 0.1 0.1 0.3 0.2

But this is still not what OP asked for because it carries the last observation forward while you want the mean for the closest non-NA values. Therefore we need the argument f (also from approx)

na.approx(x, method = "constant", f = .5)
# [1] 0.1 0.2 0.2 0.3 0.2 # looks good

From ?approx

f : for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values.

Lastly, if we don't want to replace the NAs at the beginning and end of each row we need na.rm = FALSE.

From ?na.approx

na.rm : logical. If the result of the (spline) interpolation still results in NAs, should these be removed?

data

dat <- structure(list(Date_1 = c(NA, 0.1, NA, NA), Date_2 = c(0.1, NA, 
NA, NA), Date_3 = c(0.2, NA, NA, 0.1), Date_4 = c(NA, 0.3, NA, 
0.2), Date_5 = c(0.3, 0.2, 0.3, 0.1)), .Names = c("Date_1", "Date_2", 
"Date_3", "Date_4", "Date_5"), class = "data.frame", row.names = c("A", 
"B", "C", "E"))

EDIT

If there are NAs in the last column we can replace these with the last non-NAs before we apply na.approx as shown above.

dat$Date_6[is.na(dat$Date_6)] <- dat[cbind(1:nrow(dat),
                                           max.col(!is.na(dat), ties.method = "last"))][is.na(dat$Date_6)]

187

answered Oct 20 '22 09:10

markus

This is another possible answer, using na.locf from the zoo package. Edit: apply is actually not required; This solution fills in the last observed value if this value is missing.

# create the dataframe
Date1 <- c(NA,.1,NA,NA)
Date2 <- c(.1, NA,NA,NA)
Date3 <- c(.2,NA,NA,.1)
Date4 <- c(NA,.3,NA,.2)
Date5 <- c(.3,.2,.3,.1)
Date6 <- c(.1,NA,NA,NA)
df <- as.data.frame(cbind(Date1,Date2,Date3,Date4,Date5,Date6))
rownames(df) <- c('A','B','C','D')

> df
  Date1 Date2 Date3 Date4 Date5 Date6
A    NA   0.1   0.2    NA   0.3   0.1
B   0.1    NA    NA   0.3   0.2    NA
C    NA    NA    NA    NA   0.3    NA
D    NA    NA   0.1   0.2   0.1    NA



# Load library
library(zoo)
df2 <- t(na.locf(t(df),na.rm = F)) # fill last observation carried forward
df3 <- t(na.locf(t(df),na.rm = F, fromLast = T)) # last obs carried backward

df4 <- (df2 + df3)/2 # mean of both dataframes

df4 <- t(na.locf(t(df4),na.rm = F)) # fill last observation carried forward
df4[is.na(df4)] <- 0 # NA values are 0

  Date1 Date2 Date3 Date4 Date5 Date6
A   0.0   0.1   0.2  0.25   0.3   0.1
B   0.1   0.2   0.2  0.30   0.2   0.2
C   0.0   0.0   0.0  0.00   0.3   0.3
D   0.0   0.0   0.1  0.20   0.1   0.1

answered Oct 20 '22 10:10

Niek

Related questions
                            
                                ggplot2: group x axis discrete values into subgroups
                            
                                Split a vector by its sequences [duplicate]
                            
                                SOAP request in R
                            
                                How to strsplit different number of strings in certain column by do function
                            
                                Convert Unix timestamp into datetime
                            
                                shiny causes RStudio to crash
                            
                                For loop over dygraph does not work in R
                            
                                Display R formula elegantly (as in Latex)
                            
                                Gradient fill for geom_bar scaled relative to each bar and not mapped to a variable
                            
                                Evaluate inline r code in rmarkdown figure caption
                            
                                Read csv file hosted on Google Drive
                            
                                Display a data frame as table in R Markdown
                            
                                centering the table generated by kable function of knitr package
                            
                                Installation of packages ‘stringr’ and ‘stringi’ had non-zero exit status
                            
                                Plot dashed regression line with geom_smooth in ggplot2
                            
                                R: roxygen2, imported packages do not appear in namespace
                            
                                Spiral barplot using ggplot & coord_polar (Condegram)
                            
                                plotly - where can I find the default color palette used in plotly package [duplicate]
                            
                                How to set up rselenium for R?
                            
                                How to control the font size of quoted text in Markdown

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With