Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search for equal variables in rows (in a smart way) and store according rows as subsets?

Tags:

r

I have a huge data frame. One column is an integer ranging from 1 to 2. What I need is a way to look for continous rows with a number of certain values in this column, subset these rows and process them later into graphs.

I attached a small example, which does at least some of the desired work: I am able to print out the subsets I am looking for. But two questions remain:

  • I guess there are way smarter methods in R then to apply a "for" loop over the complete data.frame. Any hints?
  • Which command do I have to put in where now the "print" command is to store the temporary data.frame? I guess I need a list due to the differing length of the subsets...

I already had a look at aggregate or ddply, but could not come up with a solution.

Any help is highly appreciated.

test<-c(rep(1,3),rep(2,5),rep(1,3),rep(2,3),rep(1,3),rep(2,8),rep(1,3)) 
letters<-c("a","b","c","d")
a1<-as.data.frame(cbind(test,letters))

BZ<-2   #The variable to look for
n_BZ=4  #The number of minimum appearences

k<-1  # A variable to be used as a list item index in which the subset will be stored

for (i in 2:nrow(a1)){
  if (a1$test[i-1]!=BZ & a1$test[i]==BZ)      # When "test" BECOMES "2"
    {t_temp<-a1[i,]}                            #... start writing a temporary array
  else if (a1$test[i-1]==BZ & a1$test[i]==BZ) # When "test" REMAINS "2"
    {t_temp<-rbind(t_temp,a1[i,])}              #... continue writing a temporary array 
  else if (a1$test[i-1]==BZ & a1$test[i]!=BZ) # When "test" ENDS BEING "2"
    {if (nrow(t_temp)>n_BZ)                     #... check if the temporary array has more rows then demanded
      {print(t_temp)                              #... print the array (desired: put the array to a list item k)
       k<-k+1}}                                   #... increase k
    else                                      # If array too small
    {t_temp<-NULL}                              # reset
}
like image 589
Jochen Döll Avatar asked Oct 24 '12 13:10

Jochen Döll


1 Answers

The rle function is really handy for stuff like this. It takes an atomic vector and returns a list with elements lengths and values, where lengths contains the run length of each value in values.

Since the call to cbind in your example coerces the test column to factor, I first converted it to numeric:

a1 <- within(a1, test <- as.numeric(as.character(test)))

Then the result can be obtained in a nice (essentially) one-liner:

with(rle(a1$test),
    split(a1, rep(seq_along(lengths), lengths))[values == BZ & lengths >= n_BZ]
)
like image 67
Matthew Plourde Avatar answered Oct 06 '22 23:10

Matthew Plourde