I have a huge data frame. One column is an integer ranging from 1 to 2. What I need is a way to look for continous rows with a number of certain values in this column, subset these rows and process them later into graphs.
I attached a small example, which does at least some of the desired work: I am able to print out the subsets I am looking for. But two questions remain:
I already had a look at aggregate or ddply, but could not come up with a solution.
Any help is highly appreciated.
test<-c(rep(1,3),rep(2,5),rep(1,3),rep(2,3),rep(1,3),rep(2,8),rep(1,3))
letters<-c("a","b","c","d")
a1<-as.data.frame(cbind(test,letters))
BZ<-2 #The variable to look for
n_BZ=4 #The number of minimum appearences
k<-1 # A variable to be used as a list item index in which the subset will be stored
for (i in 2:nrow(a1)){
if (a1$test[i-1]!=BZ & a1$test[i]==BZ) # When "test" BECOMES "2"
{t_temp<-a1[i,]} #... start writing a temporary array
else if (a1$test[i-1]==BZ & a1$test[i]==BZ) # When "test" REMAINS "2"
{t_temp<-rbind(t_temp,a1[i,])} #... continue writing a temporary array
else if (a1$test[i-1]==BZ & a1$test[i]!=BZ) # When "test" ENDS BEING "2"
{if (nrow(t_temp)>n_BZ) #... check if the temporary array has more rows then demanded
{print(t_temp) #... print the array (desired: put the array to a list item k)
k<-k+1}} #... increase k
else # If array too small
{t_temp<-NULL} # reset
}
The rle
function is really handy for stuff like this. It takes an atomic vector and returns a list
with elements lengths
and values
, where lengths
contains the run length of each value in values
.
Since the call to cbind
in your example coerces the test
column to factor
, I first converted it to numeric
:
a1 <- within(a1, test <- as.numeric(as.character(test)))
Then the result can be obtained in a nice (essentially) one-liner:
with(rle(a1$test),
split(a1, rep(seq_along(lengths), lengths))[values == BZ & lengths >= n_BZ]
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With