Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill in data frame with values from rows above

Tags:

dataframe

r

Say I have a data frame like this:

ID,  ID_2, FIRST, VALUE
-----------------------
'a', 'aa', TRUE, 2
'a', 'ab', FALSE, NA
'a', 'ac', FALSE, NA
'b', 'aa', TRUE, 5
'b', 'ab', FALSE, NA

So VALUE is only set for FIRST = TRUE once per ID. ID_2 may be duplicate between IDs, but doesn't have to.

How do I put the numbers from the first rows of each ID into all rows of that ID, such that the VALUE column becomes 2, 2, 2, 5, 5?

I know I could simply loop over all IDs with a for loop, but I am looking for a more efficient way.

like image 610
Nils Avatar asked May 11 '12 15:05

Nils


People also ask

Which function is used for filling NA value with consecutive values in R?

The fillna() function is used to fill NA/NaN values using the specified method.

How do you add an observation to a data frame?

Adding Single Observation / Row To R Data Frame Create a new Data Frame of the same number of variables/columns. Name the newly created Data Frame variable as of old Data Frame in which you want to add this observation. Use the rbind() function to add a new observation.

How do you expand the data frame by adding rows and columns in data frame for employee data set?

To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function. In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame.


2 Answers

The question asks for efficiency compared with a loop. Here is a comparison of four solutions:

  1. zoo::na.locf, which introduces a package dependency, and although it handles many edge cases, requires that the 'blank' values are NA. The other solutions are easily adapted to non-NA blanks.

  2. A simple loop in base R.

  3. A recursive function in base R.

  4. My own vectorised solution in base R.

  5. The new fill() function in tidyr version 0.3.0., which works on data.frames.

Note that most of these solutions are for vectors, not data frames, so they don't check any ID column. If the data frame isn't grouped by ID, with the value to be filled down being at the top of each group, then you could try a windowing function in dplyr or data.table

# A popular solution
f1 <- zoo::na.locf

# A loop, adapted from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f2 <- function(x) {
  for(i in seq_along(x)[-1]) if(is.na(x[i])) x[i] <- x[i-1]
  x
}

# Recursion, also from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f3 <- function(z) { 
  y <- c(NA, head(z, -1))
  z <- ifelse(is.na(z), y, z)
  if (any(is.na(z))) Recall(z) else z }

# My own effort
f4 <- function(x, blank = is.na) {
  # Find the values
  if (is.function(blank)) {
    isnotblank <- !blank(x)
  } else {
    isnotblank <- x != blank
  }
  # Fill down
  x[which(isnotblank)][cumsum(isnotblank)]
}

# fill() from the `tidyr` version 0.3.0
library(tidyr)
f5 <- function(y) {
  fill(y, column)
}
# Test data, 2600 values, ~58% blanks
x <- rep(LETTERS, 100)
set.seed(2015-09-12)
x[sample(1:2600, 1500)] <- NA
x <- c("A", x) # Ensure the first element is not blank
y <- data.frame(column = x, stringsAsFactors = FALSE) # data.frame version of x for tidyr

# Check that they all work (they do)
identical(f1(x), f2(x))
identical(f1(x), f3(x))
identical(f1(x), f4(x))
identical(f1(x), f5(y)$column)

library(microbenchmark)
microbenchmark(f1(x), f2(x), f3(x), f4(x), f5(y))

Results:

Unit: microseconds
  expr      min        lq       mean    median        uq       max neval
 f1(x)  422.762  466.6355  508.57284  505.6760  527.2540   837.626   100
 f2(x) 2118.914 2206.7370 2501.04597 2312.8000 2497.2285  5377.018   100
 f3(x) 7800.509 7832.0130 8127.06761 7882.7010 8395.3725 14128.107   100
 f4(x)   52.841   58.7645   63.98657   62.1410   65.2655   104.886   100
 f5(y)  183.494  225.9380  305.21337  331.0035  350.4040   529.064   100
like image 166
nacnudus Avatar answered Sep 22 '22 02:09

nacnudus


If you need only to carry forward the values from the VALUE column, then I think you can use na.lofc() function from zoo package. Here is an example:

a<-c(1,NA,NA,2,NA)
na.locf(a)
[1] 1 1 1 2 2
like image 20
Joy Avatar answered Sep 24 '22 02:09

Joy