Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem with data frame transformation using dplyr package

Tags:

dataframe

r

dplyr

Problem

Let's consider two data frames :

One containing only 1's and 0's and second one with data :

set.seed(20)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))

#zero_one data frame
  sample.0.1..5..T. sample.0.1..5..T..1 sample.0.1..5..T..2
1                 0                   1                   0
2                 1                   0                   0
3                 1                   1                   1
4                 0                   0                   0
5                 1                   0                   1
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))

#with data
  append.rnorm.4...10. append.runif.4....5. append.rexp.4...20.
1           0.08609139            0.2374272           0.3341095
2          -0.63778176            0.2297862           0.7537732
3           0.22642990            0.9447793           1.3011998
4          -0.05418293            0.8448115           1.2097271
5          10.00000000           -5.0000000          20.0000000

Now what I want to do is to change values in second data frame for which first data frame takes values 0 by mean calculated for values for which first data frame takes value one.

Example

In first column I want to replace 0.08609139 and -0.05418293 (values for which first column in first data frame takes values 0) by mean(-0.63778176, 0.22642990,10.00000000) (values for which first column in first data frame takes values 1).

I want to do it using mutate_all() function from dplyr package.

My work so far

  df1<-df1 %>% mutate_all(
      function(x) ifelse(df[x]==0, mean(x[df==1],na.rm=T,x)))

I know that the condition df[x] is meaningless, but I have no idea what should i put there. Could you please help me with that ?

like image 465
John Avatar asked Dec 10 '25 11:12

John


1 Answers

You could follow @deschen's suggestion and multiply the two data frames together.

Here is another approach to consider using mapply. For each column, identify the positions (indices) in df where value is zero.

Then, substitute the corresponding df1 column of those positions with the mean of other values in the column. y[-idx] should be all values in the df1 column that exclude those positions.

Note that my set.seed is different - when I used yours of 20 I got different values, and a column with all zeroes. Please let me know if you are able to reproduce.

set.seed(12)

df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))

my_fun <- function(x, y) {
  idx <- which(x == 0)
  y[idx] <- mean(y[-idx])
  return(y)
}

mapply(my_fun, df, df1)
like image 128
Ben Avatar answered Dec 12 '25 02:12

Ben



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!