Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying custom function to each row uses only first value of argument

Tags:

I am trying to recode NA values to 0 in a subset of columns using the following dataset:

set.seed(1)
df <- data.frame(
  id = c(1:10),
  trials = sample(1:3, 10, replace = T),
  t1 = c(sample(c(1:9, NA), 10)),
  t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
  t3 = c(sample(c(1:5, rep(NA, 5)), 10))
  )

Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.

The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.

So, I tried using this function:

replace0 <- function(x, num.sun) {
  x[which(is.na(x[1:(num.sun + 2)]))] <- 0
  return(x)
}

This works well for single vectors. When I try applying the same function to a data frame with apply(), though:

apply(df, 1, replace0, num.sun = df$trials)

I get a warning saying:

In 1:(num.sun + 2) :
  numerical expression has 10 elements: only the first used

The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?

Thanks!

Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:

df <- data.frame(
  id = c(1:5),
  trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
  t1 = c(NA, 7, NA, 6, NA),
  t2 = c(NA, NA, 3, 7, 12),
  t3 = c(NA, NA, NA, 4, NA)
)
like image 642
Aziggy Avatar asked Sep 02 '18 07:09

Aziggy


People also ask

How do I apply custom function to pandas data frame for each row?

In order to apply a function to every row, you should use axis=1 param to apply(). By applying a function to each row, we can create a new column by using the values from the row, updating the row e.t.c. Note that by default it uses axis=0 meaning it applies a function to each column.

How do you call a function for each row in pandas?

Python is a great language for performing data analysis tasks. It provides with a huge amount of Classes and function which help in analyzing and manipulating data in an easier way. One can use apply() function in order to apply function to every row in given dataframe.

How do you apply a function to a whole data frame?

The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).


2 Answers

Another approach:

# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)

# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)

# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]

# use 'i' to replace the allowed NA's with a zero
df[i] <- 0

which gives:

> df
   id trials t1 t2 t3
1   1      1  3 NA  5
2   2      2  2  2 NA
3   3      2  6  6  4
4   4      3  0  1  2
5   5      1  5 NA NA
6   6      3  7  0  0
7   7      3  8  7  0
8   8      2  4  5  1
9   9      2  1  3 NA
10 10      1  9  4  3

You could easily wrap this in a function:

replace.NA.with.0 <- function(df) {
  w <- which(is.na(df), arr.ind = TRUE)
  m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
  i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
  df[i] <- 0
  return(df)
}

Now, using replace.NA.with.0(df) will produce the above result.


As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:

replace.with.NA.or.0 <- function(df) {
  w <- which(is.na(df), arr.ind = TRUE)
  df[w] <- 0

  v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
  ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
  df[ina] <- NA

  return(df)
}

Now, using replace.with.NA.or.0(df) produces the following result:

   id trials t1 t2 t3
1   1      1  3 NA NA
2   2      2  2  2 NA
3   3      2  6  6 NA
4   4      3  0  1  2
5   5      1  5 NA NA
6   6      3  7  0  0
7   7      3  8  7  0
8   8      2  4  5 NA
9   9      2  1  3 NA
10 10      1  9 NA NA
like image 158
Jaap Avatar answered Sep 28 '22 16:09

Jaap


Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6

replace0 <- function(x){
         #browser()
         x_na <- x[paste0('t',x['trials'])]
         if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
     return(x)
}

t(apply(df, 1, replace0))

     id trials t1 t2 t3
[1,]  1      1  3 NA  5
[2,]  2      2  2  2 NA
[3,]  3      2  6  6  4
[4,]  4      3 NA  1  2
[5,]  5      1  5 NA NA
[6,]  6      3  7 NA  0
[7,]  7      3  8  7  0
[8,]  8      2  4  5  1
[9,]  9      2  1  3 NA
[10,] 10      1  9  4  3
like image 21
A. Suliman Avatar answered Sep 28 '22 17:09

A. Suliman