I have a large data.frame
where the first three columns contain information about a marker. The remaining columns are of numeric type for that marker in each individual. Each individual has three columns. The dataset looks as follows:
marker alleleA alleleB X818 X818.1 X818.2 X345 X345.1 X345.2 X346 X346.1 X346.2
1 kgp5209280_chr3_21902067 T A 0.0000 1.0000 0.0000 1.0000 0.0000 0.0000 0.0000 1.0000 0.0000
2 chr3_21902130_21902131_A_T A T 0.8626 0.1356 0.0018 0.7676 0.2170 0.0154 0.8626 0.1356 0.0018
3 chr3_21902134_21902135_T_C T C 0.6982 0.2854 0.0164 0.5617 0.3749 0.0634 0.6982 0.2854 0.0164
That is, for each marker (row), each individual has three values, one in each column.
I want to create a new data.frame
which has all the same rows as in the original, but only one column per individual. In the one column for each individual I want the value out of the three for each individual which is greater than 0.8. If no value is greater than 0.8 then I want to print NA. For instance, in the data set I have given for the first row I would want the second value for 818 (1.0000), and the first value for 345 (1.0000). In the second row, I want the first value for 818 (0.8626), and for 345 none of the values are above 0.8 so I want NA to be printed and so on. The new data set would therefore look like this:
marker alleleA alleleB X818 X345
1 kgp5209280_chr3_21902067 T A 1.0000 1
2 chr3_21902130_21902131_A_T A T 0.8626 NA
I have been trying to use if/else
statements, along the lines of if [, 4] > 0.8 then [, 4], else...
however it doesn't seem to give me what I want, and I would also have to loop this command so it doesn't just do it for one individual in the first three columns but for all columns.
Any help would be appreciated! Thanks in advance.
data.table
versions >= 1.9.0. Go here for more info.require(data.table)
require(reshape2)
dt <- as.data.table(df)
# melt data.table
dt.m <- melt(dt, id=c("marker", "alleleA", "alleleB"),
variable.name="id", value.name="val")
dt.m[, id := gsub("\\.[0-9]+$", "", id)] # replace `.[0-9]` with nothing
# aggregation
dt.m <- dt.m[, list(alleleA = alleleA[1],
alleleB = alleleB[1], val = max(val)),
keyby=list(marker, id)][val <= 0.8, val := NA]
# casting back
dt.c <- dcast.data.table(dt.m, marker + alleleA + alleleB ~ id)
# marker alleleA alleleB X345 X346 X818
# 1: chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
# 2: chr3_21902134_21902135_T_C T C NA NA NA
# 3: kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
Solution 1: Probably not the best way, but this is what I could think of at the moment:
mm <- t(apply(df[-(1:3)], 1, function(x) tapply(x, gl(3,3), max)))
mode(mm) <- "numeric"
mm[mm < 0.8] <- NA
# you can set the column names of mm here if necessary
out <- cbind(df[, 1:3], mm)
# marker alleleA alleleB 1 2 3
# 1 kgp5209280_chr3_21902067 T A 1.0000 1 1.0000
# 2 chr3_21902130_21902131_A_T A T 0.8626 NA 0.8626
# 3 chr3_21902134_21902135_T_C T C NA NA NA
gl(3,3)
gives a factor with values 1,1,1,2,2,2,3,3,3
with levels 1,2,3
. That is, tapply
will take the values x
3 at a time and get their max
(first 3, next 3 and the last 3). And apply
sends each row one by one.
Solution 2: A data.table
solution with melt
and cast
within data.table
without using reshape
or reshape2
:
require(data.table)
dt <- data.table(df)
# melt your data.table to long format
dt.melt <- dt[, list(id = names(.SD), val = unlist(.SD)),
by=list(marker, alleleA, alleleB)]
# replace `.[0-9]` with nothing
dt.melt[, id := gsub("\\.[0-9]+$", "", id)]
# get max value grouping by marker and id
dt.melt <- dt.melt[, list(alleleA = alleleA[1],
alleleB = alleleB[1],
val = max(val)),
keyby=list(marker, id)][val <= 0.8, val := NA]
# edit mnel (use setattr(,'names') to avoid copy by `names<-` within `setNames`
dt.cast <- dt.melt[, as.list(setattr(val,'names', id)),
by=list(marker, alleleA, alleleB)]
# marker alleleA alleleB X345 X346 X818
# 1: chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
# 2: chr3_21902134_21902135_T_C T C NA NA NA
# 3: kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
I think it is better here to put your data in the long format. Here a solution based on reshape2
package , maybe similar to second @Arun solution but syntactically different
library(reshape2)
dat.m <- melt(dat,id.vars=1:3)
dat.m$variable <- gsub('[.].*','',dat.m$variable)
dcast(dat.m,...~variable,fun.aggregate=function(x){
res <- NA_real_
if(length(x) > 0 && max(x)> 0.8)
res <- max(x)
res
})
marker alleleA alleleB X345 X346 X818
1 chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
2 chr3_21902134_21902135_T_C T C NA NA NA
3 kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With