I have a dataframe as follows:
chr leftPos Sample1 X.DD 3_samples MyStuff
1 324 -1 1 1 1
1 4565 -1 0 0 0
1 6887 -1 1 0 0
1 12098 1 -1 1 1
2 12 -1 1 0 1
2 43 -1 1 1 1
5 1 -1 1 1 0
5 43 0 1 -1 0
5 6554 1 1 1 1
5 7654 -1 0 0 0
5 8765 1 1 1 0
5 9833 1 1 1 -1
6 12 1 1 0 0
6 43 0 0 0 0
6 56 1 0 0 0
6 79 1 0 -1 0
6 767 1 0 -1 0
6 3233 1 0 -1 0
I would like to convert it according to the following rules For each chromosome:
a. If there are three or more 1's or -1's consecutively in a column then the value stays as it is.
b. If there are less than three 1's or -1s consecutively in a column then the value of the 1 or -1 changes to 0
The rows in a column have to have the same sign (+ or -ve) to be called consecutive.
The result of the dataframe above should be:
chr leftPos Sample1 X.DD 3_samples MyStuff
1 324 -1 0 0 0
1 4565 -1 0 0 0
1 6887 -1 0 0 0
1 12098 0 0 0 0
2 12 0 0 0 0
2 43 0 0 0 0
5 1 0 1 0 0
5 43 0 1 0 0
5 6554 0 1 0 0
5 7654 0 0 0 0
5 8765 0 0 0 0
5 9833 0 0 0 0
6 12 0 0 0 0
6 43 0 0 0 0
6 56 1 0 0 0
6 79 1 0 -1 0
6 767 1 0 -1 0
6 3233 1 0 -1 0
I have managed to do this for two consecutive rows but I'm not sure how to change this for three or more rows.
DAT_list2res <-cbind(DAT_list2[1:2],DAT_list2res)
colnames(DAT_list2res)[1:2]<-c("chr","leftPos")
DAT_list2res$chr<-as.numeric(gsub("chr","",DAT_list2res$chr))
DAT_list2res<-as.data.frame(DAT_list2res)
dx<-DAT_list2res
f0 <- function( colNr, dx)
{
col <- dx[,colNr]
n1 <- which(col == 1| col == -1) # The `1`-rows.
d0 <- which( diff(col) == 0) # Consecutive rows in a column are equal.
dc0 <- which( diff(dx[,1]) == 0) # Same chromosome.
m <- intersect( n1-1, intersect( d0, dc0 ) )
return ( setdiff( 1:nrow(dx), union(m,m+1) ) )
}
g <- function( dx )
{
for ( i in 3:ncol(dx) ) { dx[f0(i,dx),i] <- 0 }
return ( dx )
}
dx<-g(dx)
Here is one solution only using base R
.
First define a function that will replace any repetitions which are less than 3 for zeros:
replace_f <- function(x){
subs <- rle(x)
subs$values[subs$lengths < 3] <- 0
inverse.rle(subs)
}
Then split your data.frame
by chr
and then apply the function to all columns that you want to change (in this case columns 3 to 6):
df[,3:6] <- do.call("rbind", lapply(split(df[,3:6], df$chr), function(x) apply(x, 2, replace_f)))
Notice that we combine the results together with rbind
before replacing the original data. This will give you the desired result:
chr leftPos Sample1 X.DD X3_samples MyStuff
1 1 324 -1 0 0 0
2 1 4565 -1 0 0 0
3 1 6887 -1 0 0 0
4 1 12098 0 0 0 0
5 2 12 0 0 0 0
6 2 43 0 0 0 0
7 5 1 0 1 0 0
8 5 43 0 1 0 0
9 5 6554 0 1 0 0
10 5 7654 0 0 0 0
11 5 8765 0 0 0 0
12 5 9833 0 0 0 0
13 6 12 0 0 0 0
14 6 43 0 0 0 0
15 6 56 1 0 0 0
16 6 79 1 0 -1 0
17 6 767 1 0 -1 0
18 6 3233 1 0 -1 0
A data.table
solution using rleid
would be
require(data.table)
setDT(dat)
dat[,Sample1 := Sample1 * as.integer(.N>=3), by=.(chr, rleid(Sample1))]
This used the grouping by rleid(Sample1)
and data.table
's helpful .N
-variable.
Doing it for all columns you could use the eval(parse(text=...))
syntax as follows:
for(i in names(dat)[3:6]){
by_string = paste0("list(chr, rleid(", i, "))")
def_string = paste0(i, "* as.integer(.N>=3)")
dat[,(i) := eval(parse(text=def_string)), by=eval(parse(text=by_string))]
}
So it results in:
> dat[]
chr leftPos Sample1 X.DD X3_samples MyStuff
1: 1 324 -1 0 0 0
2: 1 4565 -1 0 0 0
3: 1 6887 -1 0 0 0
4: 1 12098 0 0 0 0
5: 2 12 0 0 0 0
6: 2 43 0 0 0 0
7: 5 1 0 1 0 0
8: 5 43 0 1 0 0
9: 5 6554 0 1 0 0
10: 5 7654 0 0 0 0
11: 5 8765 0 0 0 0
12: 5 9833 0 0 0 0
13: 6 12 0 0 0 0
14: 6 43 0 0 0 0
15: 6 56 1 0 0 0
16: 6 79 1 0 -1 0
17: 6 767 1 0 -1 0
18: 6 3233 1 0 -1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With