Say I have a data frame like this: <pre class="prettyprint"><code>ID, ID_2, FIRST, VALUE ----------------------- 'a', 'aa', TRUE, 2 'a', 'ab', FALSE, NA 'a', 'ac', FALSE, NA 'b', 'aa', TRUE, 5 'b', 'ab', FALSE, NA </code></pre> So VALUE is only set for FIRST = TRUE once per ID. ID_2 may be duplicate between IDs, but doesn't have to. How do I put the numbers from the first rows of each ID into all rows of that ID, such that the VALUE column becomes 2, 2, 2, 5, 5? I know I could simply loop over all IDs with a for loop, but I am looking for a more efficient way.

The question asks for efficiency compared with a loop. Here is a comparison of four solutions: <ol> <li><code>zoo::na.locf</code>, which introduces a package dependency, and although it handles many edge cases, requires that the 'blank' values are NA. The other solutions are easily adapted to non-NA blanks.</li> <li>A simple loop in base R.</li> <li>A recursive function in base R.</li> <li>My own vectorised solution in base R.</li> <li>The new <code>fill()</code> function in <code>tidyr</code> version 0.3.0., which works on data.frames.</li> </ol> Note that most of these solutions are for vectors, not data frames, so they don't check any ID column. If the data frame isn't grouped by ID, with the value to be filled down being at the top of each group, then you could try a windowing function in <code>dplyr</code> or <code>data.table</code> <pre class="prettyprint"><code># A popular solution f1 <- zoo::na.locf # A loop, adapted from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html f2 <- function(x) { for(i in seq_along(x)[-1]) if(is.na(x[i])) x[i] <- x[i-1] x } # Recursion, also from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html f3 <- function(z) { y <- c(NA, head(z, -1)) z <- ifelse(is.na(z), y, z) if (any(is.na(z))) Recall(z) else z } # My own effort f4 <- function(x, blank = is.na) { # Find the values if (is.function(blank)) { isnotblank <- !blank(x) } else { isnotblank <- x != blank } # Fill down x[which(isnotblank)][cumsum(isnotblank)] } # fill() from the `tidyr` version 0.3.0 library(tidyr) f5 <- function(y) { fill(y, column) } # Test data, 2600 values, ~58% blanks x <- rep(LETTERS, 100) set.seed(2015-09-12) x[sample(1:2600, 1500)] <- NA x <- c("A", x) # Ensure the first element is not blank y <- data.frame(column = x, stringsAsFactors = FALSE) # data.frame version of x for tidyr # Check that they all work (they do) identical(f1(x), f2(x)) identical(f1(x), f3(x)) identical(f1(x), f4(x)) identical(f1(x), f5(y)$column) library(microbenchmark) microbenchmark(f1(x), f2(x), f3(x), f4(x), f5(y)) </code></pre> Results: <pre class="prettyprint"><code>Unit: microseconds expr min lq mean median uq max neval f1(x) 422.762 466.6355 508.57284 505.6760 527.2540 837.626 100 f2(x) 2118.914 2206.7370 2501.04597 2312.8000 2497.2285 5377.018 100 f3(x) 7800.509 7832.0130 8127.06761 7882.7010 8395.3725 14128.107 100 f4(x) 52.841 58.7645 63.98657 62.1410 65.2655 104.886 100 f5(y) 183.494 225.9380 305.21337 331.0035 350.4040 529.064 100 </code></pre>

If you need only to carry forward the values from the VALUE column, then I think you can use <code>na.lofc()</code> function from zoo package. Here is an example: <pre class="prettyprint"><code>a<-c(1,NA,NA,2,NA) na.locf(a) [1] 1 1 1 2 2 </code></pre>

Fill in data frame with values from rows above

Tags:

dataframe

r

Say I have a data frame like this:

ID,  ID_2, FIRST, VALUE
-----------------------
'a', 'aa', TRUE, 2
'a', 'ab', FALSE, NA
'a', 'ac', FALSE, NA
'b', 'aa', TRUE, 5
'b', 'ab', FALSE, NA

So VALUE is only set for FIRST = TRUE once per ID. ID_2 may be duplicate between IDs, but doesn't have to.

How do I put the numbers from the first rows of each ID into all rows of that ID, such that the VALUE column becomes 2, 2, 2, 5, 5?

I know I could simply loop over all IDs with a for loop, but I am looking for a more efficient way.

610

asked May 11 '12 15:05

Nils

2 Answers

The question asks for efficiency compared with a loop. Here is a comparison of four solutions:

zoo::na.locf, which introduces a package dependency, and although it handles many edge cases, requires that the 'blank' values are NA. The other solutions are easily adapted to non-NA blanks.
A simple loop in base R.
A recursive function in base R.
My own vectorised solution in base R.
The new fill() function in tidyr version 0.3.0., which works on data.frames.

Note that most of these solutions are for vectors, not data frames, so they don't check any ID column. If the data frame isn't grouped by ID, with the value to be filled down being at the top of each group, then you could try a windowing function in dplyr or data.table

# A popular solution
f1 <- zoo::na.locf

# A loop, adapted from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f2 <- function(x) {
  for(i in seq_along(x)[-1]) if(is.na(x[i])) x[i] <- x[i-1]
  x
}

# Recursion, also from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f3 <- function(z) { 
  y <- c(NA, head(z, -1))
  z <- ifelse(is.na(z), y, z)
  if (any(is.na(z))) Recall(z) else z }

# My own effort
f4 <- function(x, blank = is.na) {
  # Find the values
  if (is.function(blank)) {
    isnotblank <- !blank(x)
  } else {
    isnotblank <- x != blank
  }
  # Fill down
  x[which(isnotblank)][cumsum(isnotblank)]
}

# fill() from the `tidyr` version 0.3.0
library(tidyr)
f5 <- function(y) {
  fill(y, column)
}
# Test data, 2600 values, ~58% blanks
x <- rep(LETTERS, 100)
set.seed(2015-09-12)
x[sample(1:2600, 1500)] <- NA
x <- c("A", x) # Ensure the first element is not blank
y <- data.frame(column = x, stringsAsFactors = FALSE) # data.frame version of x for tidyr

# Check that they all work (they do)
identical(f1(x), f2(x))
identical(f1(x), f3(x))
identical(f1(x), f4(x))
identical(f1(x), f5(y)$column)

library(microbenchmark)
microbenchmark(f1(x), f2(x), f3(x), f4(x), f5(y))

Results:

Unit: microseconds
  expr      min        lq       mean    median        uq       max neval
 f1(x)  422.762  466.6355  508.57284  505.6760  527.2540   837.626   100
 f2(x) 2118.914 2206.7370 2501.04597 2312.8000 2497.2285  5377.018   100
 f3(x) 7800.509 7832.0130 8127.06761 7882.7010 8395.3725 14128.107   100
 f4(x)   52.841   58.7645   63.98657   62.1410   65.2655   104.886   100
 f5(y)  183.494  225.9380  305.21337  331.0035  350.4040   529.064   100

166

answered Sep 22 '22 02:09

nacnudus

If you need only to carry forward the values from the VALUE column, then I think you can use na.lofc() function from zoo package. Here is an example:

a<-c(1,NA,NA,2,NA)
na.locf(a)
[1] 1 1 1 2 2

answered Sep 24 '22 02:09

Joy

Related questions
                            
                                R - Scaling numeric values only in a dataframe with mixed types
                            
                                How to convert the name of a dataframe to a string in R?
                            
                                Complicated reshaping
                            
                                Convert hours:minutes:seconds to minutes
                            
                                Line breaks in R Markdown text (not code blocks)
                            
                                How can I prevent a library from masking functions
                            
                                How to replace empty string with NA in R dataframe?
                            
                                Sort data frame column by factor
                            
                                Three dimensional array to list
                            
                                How do I combine aes() and aes_string() options
                            
                                rmarkdown error "attempt to use zero-length variable name"
                            
                                More efficient R / Sweave / TeXShop work-flow?
                            
                                How do I add the mean value to a histogram in R?
                            
                                Read csv from specific row
                            
                                How do I generate a histogram for each column of my table?
                            
                                Add missing value in column with value from row above
                            
                                Joining aggregated values back to the original data frame [duplicate]
                            
                                How to fill NAs with LOCF by factors in data frame, split by country
                            
                                Difference between the == and %in% operators in R [duplicate]
                            
                                How to find the difference in value in every two consecutive rows in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With