Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Filter Data from a data frame

I am a new guy in R and really unsure how to filter data in date frame.

I have created a data frame with two columns including monthly date and corresponding temperature. It has a length of 324.

> head(Nino3.4_1974_2000)
  Month_common               Nino3.4_degree_1974_2000_plain
1   1974-01-15                       -1.93025
2   1974-02-15                       -1.73535
3   1974-03-15                       -1.20040
4   1974-04-15                       -1.00390
5   1974-05-15                       -0.62550
6   1974-06-15                       -0.36915

The filter rule is to select the temperature which are greater or equal to 0.5 degree. Also, it has to be at least continuously 5 months.

I have eliminate the data with less than 0.5 degree temperature (see below).

for (i in 1) {
el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),]
}

> head(el_nino)
   Month_common               Nino3.4_degree_1974_2000_plain
32   1976-08-15                      0.5192000
33   1976-09-15                      0.8740000
34   1976-10-15                      0.8864501
35   1976-11-15                      0.8229501
36   1976-12-15                      0.7336500
37   1977-01-15                      0.9276500

However, i still need to extract continuously 5 months. I wish someone could help me out.

like image 346
Yu Deng Avatar asked Jan 18 '12 05:01

Yu Deng


2 Answers

If you can always rely on the spacing being one month, then let's temporarily discard the time information:

temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain

So, since every temperature in that vector is always separated by one month, we just have to look for runs where the temps[i]>=0.5, and the run has to be at least 5 long.

If we do the following:

ofinterest <- temps >= 0.5

we'll have a vector ofinterest with values TRUE FALSE FALSE TRUE TRUE .... etc where it's TRUE when temps[i] was >= 0.5 and FALSE otherwise.

To rephrase your problem then, we just need to look for occurences of at least five TRUE in a row.

To do this we can use the function rle. ?rle gives:

> ?rle
Description
     Compute the lengths and values of runs of equal values in a vector
     - or the reverse operation.
Value:
     ‘rle()’ returns an object of class ‘"rle"’ which is a list with
     components:    
 lengths: an integer vector containing the length of each run.
  values: a vector of the same length as ‘lengths’ with the
          corresponding values.

So we use rle which counts up all the streaks of consecutive TRUE in a row and consecutive FALSE in a row, and look for at least 5 TRUE in a row.

I'll just make up some data to demonstrate:

# for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
temps <- runif(1000) 

# make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise
ofinterest <- temps >= 0.5

# count up the runs of TRUEs and FALSEs using rle:
runs <- rle(ofinterest) 

# we need to find points where runs$lengths >= 5 (ie more than 5 in a row), 
# AND runs$values is TRUE (so more than 5 'TRUE's in a row).
streakIs <- which(runs$lengths>=5 & runs$values)

# these are all the el_nino occurences. 
# We need to convert `streakIs` into indices into our original `temps` vector.
# To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives
#  the index into `temps`.
# that is:
# startMonths <- c()
# for ( n in streakIs ) {
#     startMonths <- c(startMonths,   sum(runs$lengths[1:(n-1)]) + 1
# }
#
# However, since this is R we can vectorise with sapply:
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)

Now if you do Nino3.4_1974_2000$Month_common[startMonths] you'll get all the months in which the El Nino started.

It boils down to just a few lines:

runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5) 
streakIs <- which(runs$lengths>=5 & runs$values)
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
Nino3.4_1974_2000$Month_common[startMonths]
like image 183
mathematical.coffee Avatar answered Sep 22 '22 04:09

mathematical.coffee


Here's one way using the fact that the months are regular always one month apart. Than the problem reduces to finding 5 consecutive rows with temps >= 0.5 degrees:

# Some sample data
d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2)))
d

# Use rle to find runs of temps >= 0.5 degrees
x <- rle(d$Temp >= 0.5)

# The find the last row in each run of 5 or more
y <- x$lengths>=5 # BUG HERE: See update below!
lastRow <- cumsum(x$lengths)[y]

# Finally, deduce the first row and make a result matrix
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow) 
res
#     firstRow lastRow
#[1,]        1       6
#[2,]       13      17

UPDATE I had a bug that detected runs with 5 values less than 0.5 too. Here's the updated code (and test data):

d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1))
x <- rle(d$Temp >= 0.5)
y <- x$lengths>=5 & x$values
lastRow <- cumsum(x$lengths)[y]
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow) 
res
#     firstRow lastRow
#[2,]       14      18
like image 37
Tommy Avatar answered Sep 23 '22 04:09

Tommy