I have a 5845*1095 (rows*columns) data frame that looks like this:
9 286593 C C/C C/A A/A
9 334337 A A/A G/A A/A
9 390512 C C/C C/C C/C
c <- c("9", "286593", "C", "C/C", "C/A", "A/A")
d <- c("9", "334337", "A", "A/A", "G/A", "A/A")
e <- c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))
I want the values in the third column to be used to change the columns to its right so if (per row 1) column 3 is "C", then column 4 is turned from "C/C" to "0" as it has the same letter. One letter match is "1" (can be first or second letter) and no letter match is "2" .
9 286593 C 0 1 2
9 334337 A 0 1 0
9 390512 C 0 0 0
c <- c("9", "286593", "C", "0", "1", "2")
d <- c("9", "334337", "A", "0", " 1", "0")
e <- c("9", "390512", "C", "0", "0", "0")
dat <- data.frame(rbind(c,d,e))
I am interested to see the best way to do this as I want to get out of the habit of using nested For loops in R.
First your data:
c <- c("9", "286593", "C", "C/C", "C/A", "A/A")
# Note: In your original data, you had a space in "G/A", which I did remove.
# If this was no mistake, we would also have to deal with the space.
d <- c("9", "334337", "A", "A/A", "G/A", "A/A")
e <- c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))
Now we generate us a vector that has all the possible letters available.
values <- c("A", "C", "G", "T")
dat$X3 <- factor(dat$X3, levels=values) # This way we just ensure that it will later on be possible to compare the reference values to our generated data.
# Generate all possible combinations of two letters
combinations <- expand.grid(f=values, s=values)
combinations <- cbind(combinations, v=with(combinations, paste(f, s, sep='/')))
The main function finds the correct columns of each combination of each column and then compares this to the reference column 3.
compare <- function(col, val) {
m <- match(col, combinations$v)
2 - (combinations$f[m] == val) - (combinations$s[m] == val)
}
Finally we use apply to run the function on all columns that have to be changed. You probably want to change the 6 to your actual number of columns.
dat[,4:6] <- apply(dat[,4:6], 2, compare, val=dat[,3])
Note that this solution compared to the other solutions up to now does not use string comparison but an approach purely based on factor levels. Would be interesting to see which one performs better.
I just did some benchmarking:
test replications elapsed relative user.self sys.self user.child sys.child
1 arun 1000000 2.881 1.116 2.864 0.024 0 0
2 fabio 1000000 2.593 1.005 2.558 0.030 0 0
3 roland 1000000 2.727 1.057 2.687 0.048 0 0
5 thilo 1000000 2.581 1.000 2.540 0.036 0 0
4 tyler 1000000 2.663 1.032 2.626 0.042 0 0
which leaves my version slightly faster. However, the difference is close to nothing, so you are probably fine with every single approach. And to be fair: I did not benchmark the part where I add additional factor levels. Doing this as well would probably rule my version out.
Here is one approache:
FUN <- function(x) {
a <- strsplit(as.character(unlist(x[-1])), "/")
b <- sapply(a, function(y) sum(y %in% as.character(unlist(x[1]))))
2 - b
}
dat[4:6] <- t(apply(dat[, 3:6], 1, FUN))
## > dat
## X1 X2 X3 X4 X5 X6
## c 9 286593 C 0 1 2
## d 9 334337 A 0 1 0
## e 9 390512 C 0 0 0
Here's one way using apply
:
out <- apply(dat[, -(1:2)], 1, function(x)
2 - grepl(x[1], x[-1]) -
x[-1] %in% paste(x[1], x[1], sep="/"))
cbind(dat[, (1:3)], t(out))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With