I am attempting to find the number of observations by column in a data frame that meet a certain condition after the max for that column has been encountered.
Here is a highly simplified example:
fake.dat<-data.frame(samp1=c(5,6,7,5,4,5,10,5,6,7), samp2=c(2,3,4,6,7,9,2,3,7,8), samp3=c(2,3,4,11,7,9,2,3,7,8),samp4=c(5,6,7,5,4,12,10,5,6,7))
samp1 samp2 samp3 samp4
1 5 2 2 5
2 6 3 3 6
3 7 4 4 7
4 5 6 11 5
5 4 7 7 4
6 5 9 9 12
7 10 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
So, let's say I'm trying to find the number of observations per column that are greater than 5 after excluding all the observations in a column up to and including the row where the maximum for the column occurs.
Expected outcome:
samp1 samp2 samp3 samp4
2 2 4 3
I am able to get the answer I want by using nested for loops
to exclude the observations I don't want.
newfake.dat<-data.frame()
for(j in 1:length(fake.dat)){
for(i in 1:nrow(fake.dat)){
ifelse(i>max.row[j],newfake.dat[i,j]<-fake.dat[i,j],"NA")
print(newfake.dat)
}}
This creates a new data frame on which I can run an easy apply
function.
colcount<-apply(newfake.dat,2,function(x) (sum(x>5,na.rm=TRUE)))
V1 V2 V3 V4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA 7 NA
6 NA NA 9 NA
7 NA 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
V1 V2 V3 V4
2 2 4 3
Which is all well and good for this tiny example dataset, but is prohibitively slow on anything approaching the size of my real datasets. Which are large (2000 x 2000 or larger) and numerous. I tried it with a truncated version of one of my files (fewer columns, but same number of rows) and it ran for at least 5 hours (I left it going when I left work for the day). Also, I don't really need the new dataframe for anything other than to be able to run the apply function.
Is there any way to do this more efficiently? I tried limiting the rows that the apply
function works on by using seq
and the row number of the max.
maxrow<-apply(fake.dat,2,function(x) which.max(x))
print(maxrow)
seq.att<-apply(fake.dat,2,function(x) {
sum(x[which(seq(1,nrow(fake.dat))==(maxrow)):nrow(fake.dat)]>5,na.rm=TRUE)})
Which kicks up four instances of this warning message:
1: In seq(1, nrow(fake.dat)) == (maxrow) :
longer object length is not a multiple of shorter object length
If I ignore the warning message and get the output anyway it doesn't give me the answer I expected:
samp1 samp2 samp3 samp4
2 3 3 3
I also tried using a while
function which kept cycling so I stopped it (I misplaced the code I tried for this).
So far the most promising result has come from the nested for loops
, but I know it's terribly inefficient and I'm hoping that there's a better way. I'm still new to R, and I'm sure I'm tripping up on some syntax somewhere. Thanks in advance for any help you can provide!
Here is a way in dplyr
to replicate the same process that you showed with base R
library(dplyr)
fake.dat %>%
summarise_each(funs(sum(.[(which.max(.)+1):n()]>5,
na.rm=TRUE)))
# samp1 samp2 samp3 samp4
#1 2 2 4 3
If you need it as two steps:
datNA <- fake.dat %>%
mutate_each(funs(replace(., seq_len(which.max(.)), NA)))
datNA %>%
summarise_each(funs(sum(.>5, na.rm=TRUE)))
Here's one approach using data.table
:
library(data.table)
##
data <- data.frame(
samp1=c(5,6,7,5,4,5,10,5,6,7),
samp2=c(2,3,4,6,7,9,2,3,7,8),
samp3=c(2,3,4,11,7,9,2,3,7,8),
samp4=c(5,6,7,5,4,12,10,5,6,7))
##
Dt <- data.table(data)
##
R> Dt[,lapply(.SD,function(x){
y <- x[(which.max(x)+1):.N]
length(y[y>5])
})
samp1 samp2 samp3 samp4
1: 2 2 4 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With