Say I have a data frame like this:
ID, ID_2, FIRST, VALUE
-----------------------
'a', 'aa', TRUE, 2
'a', 'ab', FALSE, NA
'a', 'ac', FALSE, NA
'b', 'aa', TRUE, 5
'b', 'ab', FALSE, NA
So VALUE is only set for FIRST = TRUE once per ID. ID_2 may be duplicate between IDs, but doesn't have to.
How do I put the numbers from the first rows of each ID into all rows of that ID, such that the VALUE column becomes 2, 2, 2, 5, 5?
I know I could simply loop over all IDs with a for loop, but I am looking for a more efficient way.
The fillna() function is used to fill NA/NaN values using the specified method.
Adding Single Observation / Row To R Data Frame Create a new Data Frame of the same number of variables/columns. Name the newly created Data Frame variable as of old Data Frame in which you want to add this observation. Use the rbind() function to add a new observation.
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function. In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame.
The question asks for efficiency compared with a loop. Here is a comparison of four solutions:
zoo::na.locf
, which introduces a package dependency, and although it handles many edge cases, requires that the 'blank' values are NA. The other solutions are easily adapted to non-NA blanks.
A simple loop in base R.
A recursive function in base R.
My own vectorised solution in base R.
The new fill()
function in tidyr
version 0.3.0., which works on data.frames.
Note that most of these solutions are for vectors, not data frames, so they don't check any ID column. If the data frame isn't grouped by ID, with the value to be filled down being at the top of each group, then you could try a windowing function in dplyr
or data.table
# A popular solution
f1 <- zoo::na.locf
# A loop, adapted from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f2 <- function(x) {
for(i in seq_along(x)[-1]) if(is.na(x[i])) x[i] <- x[i-1]
x
}
# Recursion, also from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f3 <- function(z) {
y <- c(NA, head(z, -1))
z <- ifelse(is.na(z), y, z)
if (any(is.na(z))) Recall(z) else z }
# My own effort
f4 <- function(x, blank = is.na) {
# Find the values
if (is.function(blank)) {
isnotblank <- !blank(x)
} else {
isnotblank <- x != blank
}
# Fill down
x[which(isnotblank)][cumsum(isnotblank)]
}
# fill() from the `tidyr` version 0.3.0
library(tidyr)
f5 <- function(y) {
fill(y, column)
}
# Test data, 2600 values, ~58% blanks
x <- rep(LETTERS, 100)
set.seed(2015-09-12)
x[sample(1:2600, 1500)] <- NA
x <- c("A", x) # Ensure the first element is not blank
y <- data.frame(column = x, stringsAsFactors = FALSE) # data.frame version of x for tidyr
# Check that they all work (they do)
identical(f1(x), f2(x))
identical(f1(x), f3(x))
identical(f1(x), f4(x))
identical(f1(x), f5(y)$column)
library(microbenchmark)
microbenchmark(f1(x), f2(x), f3(x), f4(x), f5(y))
Results:
Unit: microseconds
expr min lq mean median uq max neval
f1(x) 422.762 466.6355 508.57284 505.6760 527.2540 837.626 100
f2(x) 2118.914 2206.7370 2501.04597 2312.8000 2497.2285 5377.018 100
f3(x) 7800.509 7832.0130 8127.06761 7882.7010 8395.3725 14128.107 100
f4(x) 52.841 58.7645 63.98657 62.1410 65.2655 104.886 100
f5(y) 183.494 225.9380 305.21337 331.0035 350.4040 529.064 100
If you need only to carry forward the values from the VALUE column, then I think you can use na.lofc()
function from zoo package. Here is an example:
a<-c(1,NA,NA,2,NA)
na.locf(a)
[1] 1 1 1 2 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With