Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

[R]: applying a function to columns based on conditional row position

I am attempting to find the number of observations by column in a data frame that meet a certain condition after the max for that column has been encountered.

Here is a highly simplified example:

fake.dat<-data.frame(samp1=c(5,6,7,5,4,5,10,5,6,7), samp2=c(2,3,4,6,7,9,2,3,7,8), samp3=c(2,3,4,11,7,9,2,3,7,8),samp4=c(5,6,7,5,4,12,10,5,6,7))

       samp1 samp2 samp3 samp4
1      5     2     2     5
2      6     3     3     6
3      7     4     4     7
4      5     6    11     5
5      4     7     7     4
6      5     9     9    12
7     10     2     2    10
8      5     3     3     5
9      6     7     7     6
10     7     8     8     7

So, let's say I'm trying to find the number of observations per column that are greater than 5 after excluding all the observations in a column up to and including the row where the maximum for the column occurs.

Expected outcome:

samp1 samp2 samp3 samp4 
   2     2     4    3 

I am able to get the answer I want by using nested for loops to exclude the observations I don't want.

newfake.dat<-data.frame()

for(j in 1:length(fake.dat)){
for(i in 1:nrow(fake.dat)){
    ifelse(i>max.row[j],newfake.dat[i,j]<-fake.dat[i,j],"NA")
print(newfake.dat)
}}

This creates a new data frame on which I can run an easy apply function.

colcount<-apply(newfake.dat,2,function(x) (sum(x>5,na.rm=TRUE)))

   V1 V2 V3 V4
1  NA NA NA NA
2  NA NA NA NA
3  NA NA NA NA
4  NA NA NA NA
5  NA NA  7 NA
6  NA NA  9 NA
7  NA  2  2 10
8   5  3  3  5
9   6  7  7  6
10  7  8  8  7

V1 V2 V3 V4 
 2  2  4  3 

Which is all well and good for this tiny example dataset, but is prohibitively slow on anything approaching the size of my real datasets. Which are large (2000 x 2000 or larger) and numerous. I tried it with a truncated version of one of my files (fewer columns, but same number of rows) and it ran for at least 5 hours (I left it going when I left work for the day). Also, I don't really need the new dataframe for anything other than to be able to run the apply function.

Is there any way to do this more efficiently? I tried limiting the rows that the apply function works on by using seq and the row number of the max.

maxrow<-apply(fake.dat,2,function(x) which.max(x))
print(maxrow)

seq.att<-apply(fake.dat,2,function(x) {
    sum(x[which(seq(1,nrow(fake.dat))==(maxrow)):nrow(fake.dat)]>5,na.rm=TRUE)})

Which kicks up four instances of this warning message:

1: In seq(1, nrow(fake.dat)) == (maxrow) :
  longer object length is not a multiple of shorter object length

If I ignore the warning message and get the output anyway it doesn't give me the answer I expected:

samp1 samp2 samp3 samp4 
    2     3     3     3 

I also tried using a while function which kept cycling so I stopped it (I misplaced the code I tried for this).

So far the most promising result has come from the nested for loops, but I know it's terribly inefficient and I'm hoping that there's a better way. I'm still new to R, and I'm sure I'm tripping up on some syntax somewhere. Thanks in advance for any help you can provide!

like image 207
NMc Avatar asked Dec 14 '22 17:12

NMc


2 Answers

Here is a way in dplyr to replicate the same process that you showed with base R

library(dplyr)
fake.dat %>% 
        summarise_each(funs(sum(.[(which.max(.)+1):n()]>5,
                na.rm=TRUE)))
#   samp1 samp2 samp3 samp4
#1     2     2     4     3

If you need it as two steps:

datNA <- fake.dat %>% 
               mutate_each(funs(replace(., seq_len(which.max(.)), NA)))

datNA %>% 
      summarise_each(funs(sum(.>5, na.rm=TRUE)))
like image 190
akrun Avatar answered Apr 25 '23 16:04

akrun


Here's one approach using data.table:

library(data.table)
##
data <- data.frame(
  samp1=c(5,6,7,5,4,5,10,5,6,7), 
  samp2=c(2,3,4,6,7,9,2,3,7,8), 
  samp3=c(2,3,4,11,7,9,2,3,7,8),
  samp4=c(5,6,7,5,4,12,10,5,6,7))
##
Dt <- data.table(data)
##
R> Dt[,lapply(.SD,function(x){
    y <- x[(which.max(x)+1):.N]
    length(y[y>5])
  })
   samp1 samp2 samp3 samp4
1:     2     2     4     3
like image 21
nrussell Avatar answered Apr 25 '23 15:04

nrussell