I have a table, the start is below:
SM_H1455 SM_H1456 SM_H1457 SM_H1461 SM_H1462 SM_H1463
ENSG00000001617.7 0 0 0 0 0 0
ENSG00000001626.9 0 0 0 0 0 0
ENSG00000002587.5 10 0 6 2 0 2
ENSG00000002726.15 8 14 0 2 16 2
ENSG00000002745.8 6 2 2 0 0 4
I want to delete rows in which >= 80% of columns have the value 0. So I have 6 cols here, if 5 or more of the columns in a row have a 0, then that row needs to be deleted.
I currently have this code:
data = data[!rowSums(data == 0), ]
But this code delete all the rows as long as they have a 0, without taking into account the 80% thresh hold.
I think that @Hong Ooi's answer is incorrect in this case. This will give you the result that you have asked for:
data <- data[rowSums(data==0)/ncol(data) < 0.8, ]
data==0 returns a data frame filled with TRUE if the value at that location is equal zero, otherwise FALSE. Numerically, R treats TRUEas having a value of 1 and FALSE as having a value of zero.
rowSums adds up the numerical equivalents of the TRUE and FALSE values for each row in the data frame returned from data==0. rowSums(data==0) basically gives the number of elements in each row in data which are zero.
ncol is the number of columns in the original data object.
rowSums(data==0)/ncol(data) is therefore the proportion of elements equal to zero in each row.
Finally, we can discard the rows where the above proprtion are not less than 80% by filtering (using [] notation).
UPDATE: @Hong Ooi's edit means that their answer is also correct now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With