Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering data frame based on values in second data frame

I have 2 data frames:

at1 = data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = rnorm(5, 50000, 2500),
      Sample2 = rnorm(5, 50000, 2500), Sample3 = rnorm(5, 50000, 2500),
      row.names = "ID")

  Sample1  Sample2  Sample3
A 52626.55 51924.51 50919.90
B 51430.51 49100.38 51005.92
C 50038.27 52254.73 50014.78
D 48644.46 53926.53 51590.05
E 46462.01 45097.48 50963.39

bt1 = data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = c(0,1,1,1,1),
      Sample2 = c(0,0,0,1,0), Sample3 = c(1,0,1,1,0), 
      row.names = "ID")

   Sample1 Sample2 Sample3
A       0       0       1
B       1       0       0
C       1       0       1
D       1       1       1
E       1       0       0

I would like to filter every cell in at1 based on the value in the corresponding cell in bt1 (0 or 1) and have the result stored in a new data frame ct1. For instance, if bt1[1, "Sample1"] = 1 then ct1[1, "Sample1"] = at1[1, "Sample1"]. If bt1[1, "Sample1"] = 0 then ct1[1, "Sample1"] = 0. My original data frames have more than 100 columns and more than 30,000 rows.

I was wondering if there is an easier way than writing if-loops (e.g. using "apply"?).

like image 231
Dalmuti71 Avatar asked Aug 16 '12 23:08

Dalmuti71


2 Answers

Here is a data.table solution, and a second simplistic solution

note that I have made ID a specific column in the data.frame not the row.names for ideological and pratical reasons

  • a data.table does not have rownames
  • I think it is easier to consider them as part of the data

library(data.table)
library(reshape2)

bt1 <- data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = c(0,1,1,1,1),
   Sample2 = c(0,0,0,1,0), Sample3 = c(1,0,1,1,0))

at1 <- data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = rnorm(5, 50000, 2500),
  Sample2 = rnorm(5, 50000, 2500), Sample3 = rnorm(5, 50000, 2500))

# place in long form
at_long <- data.table(melt(at1, id.var = 1))
bt_long <- data.table(melt(bt1, value.name = 'bt_value', id.var = 1))
# set keys for easy merging with data.tabl
setkeyv(at_long, c('ID','variable'))
setkeyv(bt_long, c('ID','variable'))
# merge
combined <- at_long[bt_long]
# set those where 'bt_value == 0' as 0
set(combined, which(combined[['bt_value']]==0), 'value',0)
# or (using the fact that the `bt` data is only 0 or 1
combined[value := value * bt_value]
# then reshape to wide format
dcast(combined, ID~variable, value.var = 'value')
##   ID  Sample1  Sample2  Sample3
## 1  A     0.00     0.00 50115.24
## 2  B 50173.16     0.00     0.00
## 3  C 48216.31     0.00 51952.30
## 4  D 52387.53 50889.95 44043.66
## 5  E 50982.56     0.00     0.00

A second, simplistic approach

If you know that the row orders are the same in bt1 and at1 (your data sets), you can simply multiply the appropriate components of the data.frames (* works element-wise)

sample_cols <- paste0('Sample',1:3)
at1[,sample_cols] * bt1[,sample_cols]

##    Sample1  Sample2  Sample3
## 1     0.00     0.00 50115.24
## 2 50173.16     0.00     0.00
## 3 48216.31     0.00 51952.30
## 4 52387.53 50889.95 44043.66
## 5 50982.56     0.00     0.00

which you could cbind to the ID from at1 or bt1 or if kept the ID as row.names, then the row.names would persist.

like image 191
mnel Avatar answered Oct 16 '22 15:10

mnel


You can use vectorisation (amongst other things).

For example:

ct1 <- at1                           # set ct1 equal to at1
ct1$Sample1[bt1$Sample1 == 0] <- 0   # if bt1$Sample1 = 0, set the value to 0

For the second line: bt1$Sample1 == 0 is a logical vector that is TRUE if bt1$Sample1 is 0, and then we use it as an index into ct1 in order to set those values to 0. Since ct1 is initialised to at1, all the other rows (where bt1$Sample1 == 1) are set to the value in at1.

Another way to do this could be using ifelse, which is a vectorised form of an if statement:

ct1$Sample1 <- ifelse(bt1$Sample1 == 0, 0, at1$Sample1)

It means "for each row in bt1$Sample1, if bt1$Sample1[row] == 0 substitute a 0, and otherwise substitute at1$Sample1[row].

You can repeat this for each column you are interested in.

You could loop through columns, or you could use something like vapply to say:

for each column `col` in bt1:
    ct1$col <- ifelse(bt1$col == 0, 0, at1$col)

This can be achieved by:

ct1 <- vapply(colnames(bt1), function (col) {
           ifelse(bt1[[col]] == 0, 0, at1[[col]])
        }, FUN.VALUE=at1$Sample1)

See ?vapply, but in brief:

  • the colnames(bt1) means "for each column in bt",
  • the function (col) { ifelse(bt1[[col]] == 0, 0, at1[[col]]) } is the statement in the pseudocode above: set the value eqqual to 0 if bt1 is 0 and set it to the value in at1 otherwise,
  • the FUN.VALUE=at1$Sample1 is because vapply requires an example of what the function will output (in our case, a column of a data frame).
like image 45
mathematical.coffee Avatar answered Oct 16 '22 17:10

mathematical.coffee