Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using ifelse with transform in ddply

Tags:

r

plyr

lubridate

I am trying to use ddply with transform to populate a new variable (summary_Date) in a dataframe with variables ID and Date. The value of the variable is chosen based on the length of the piece that is being evaluated using ifelse:

If there are less than five observations for an ID in a given month, I want to have summary_Date be calculated by rounding the date to the nearest month (using round_date from package lubridate); if there are more than five observations for an ID in a given month, I want the summary_Date to simply be Date.

require(plyr)
require(lubridate)

test.df <- structure(
  list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1
                , 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2)
       , Date = structure(c(-247320000, -246196800, -245073600, -243864000
                            , -242654400, -241444800, -126273600, -123595200
                            , -121176000, -118497600, 1359385200, 1359388800
                            , 1359392400, 1359396000, 1359399600, 1359403200
                            , 1359406800, 1359410400, 1359414000, 1359417600
                            , 55598400, 56116800, 58881600, 62078400, 64756800
                            , 67348800, 69854400, 72964800, 76161600, 79012800
                            , 1358589600, 1358676000, 1358762400, 1358848800
                            , 1358935200, 1359021600, 1359108000, 1359194400
                            , 1359280800, 1359367200), tzone = "GMT"
                          , class = c("POSIXct", "POSIXt"))
       , Val=rnorm(40))
  , .Names = c("ID", "Date", "Val"), row.names = c(NA, 40L)
  , class = "data.frame")

test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
                 , summary_Date=as.POSIXct(ifelse(length(ID)<5
                                                  , round_date(Date, "month")
                                                  ,Date)
                                           , origin="1970-01-01 00:00.00"
                                           , tz="GMT")
                 # Included length_x to easily see the length of the subset
                 , length_x = length(ID))

head(test.df,5)
#   floor_date(Date, "month") ID                Date        Val summary_Date length_x
# 1                1962-03-01  1 1962-03-01 12:00:00 -0.1037988   1962-03-01        3
# 2                1962-03-01  1 1962-03-14 12:00:00  0.2923056   1962-03-01        3
# 3                1962-03-01  1 1962-03-27 12:00:00  0.4435410   1962-03-01        3
# 4                1962-04-01  1 1962-04-10 12:00:00  0.1159164   1962-04-01        2
# 5                1962-04-01  1 1962-04-24 12:00:00  2.9824075   1962-04-01        2

The ifelse statement seems to be working, but the value in 'summary_Date' seems to be the first value calculated for the subset that transform is working on, rather than the row-specific value. For example in row 3, summary_Date should be 1962-04-01 because the date 1962-03-27 12:00:00' should be rounded up (because there are fewer than five rows in the subset), but instead the first calculated value of summary_Date (1962-03-01) is repeated in all rows in that subset.

EDIT: I was inspired by Ricardo's answer using data.table to try it in two steps with ddply. It works also:

test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
                 , length_x = length(ID))

test.df <- ddply(test.df, .(ID, floor_date(Date, "month")), transform
                 , summary_Date=as.POSIXct(ifelse(length_x<5
                                                  , round_date(Date, "month")
                                                  ,Date)
                                           , origin="1970-01-01 00:00.00"
                                           , tz="GMT"))

head(test.df,5)[c(1,3:7)]
#   floor_date(Date, "month") ID                Date        Val length_x summary_Date
# 1                1962-03-01  1 1962-03-01 12:00:00 -0.1711212        3   1962-03-01
# 2                1962-03-01  1 1962-03-14 12:00:00 -0.1531571        3   1962-03-01
# 3                1962-03-01  1 1962-03-27 12:00:00  0.1256238        3   1962-04-01
# 4                1962-04-01  1 1962-04-10 12:00:00  1.4481225        2   1962-04-01
# 5                1962-04-01  1 1962-04-24 12:00:00 -0.6508731        2   1962-05-01
like image 898
andyteucher Avatar asked Mar 01 '13 18:03

andyteucher


2 Answers

One Step ddply solution (also posted as comment)

ddply(test.df, .(ID, floor_date(Date, "month")), mutate, 
  length_x = length(ID), 
  summary_Date=as.POSIXct(ifelse(length_x < 5, round_date(Date, "month") ,Date)
    , origin="1970-01-01 00:00.00", tz="GMT")
)
like image 160
Ramnath Avatar answered Sep 19 '22 03:09

Ramnath


# transform to data.table
library(data.table)
test.dt <- data.table(test.df)

# calculate length of id by month-year. 
test.dt[, idlen := length(ID),  by=list(month(Date), year(Date)) ]

# calculate the summary date
test.dt[, summary_Date := ifelse(idlen<5, as.Date(round_date(Date, "month")), as.Date(Date))]

# If you would like to have it formatted add the following: 
test.dt[, summary_Date := as.Date(summary_Date, origin="1970-01-01")]

Results:

 > test.dt
    ID                Date         Val idlen summary_Date
 1:  1 1962-03-01 12:00:00  0.42646422     3   1962-03-01
 2:  1 1962-03-14 12:00:00 -0.29507148     3   1962-03-01
 3:  1 1962-03-27 12:00:00  0.89512566     3   1962-04-01   <~~~~~
 4:  1 1962-04-10 12:00:00  0.87813349     2   1962-04-01
 5:  1 1962-04-24 12:00:00  0.82158108     2   1962-05-01
 6:  1 1962-05-08 12:00:00  0.68864025     1   1962-05-01


UPDATE:

Explanation of why two steps are needed

The reason it cannot be done in one step has to do with the fact that you are only getting a single value per group. When you assign that value to the members of the group, you are assigning 1 element to many. R knows how to handle such situations very well: recycling the single element.

However, in this specifica case, you do not want to recycle; Rather, you do not want to apply the 1 element to many. Therefore, you need unique groups, which is what we do in the second step. Each element (row) of the group then gets assigned its own, specific value.

UPDATE 2:

@Ramnath gave a great suggestion of using mutate. Taking a look at ?mutate, it gives:

This function is very similar to transform but it executes the transformations iteratively ... later transformations can use the columns created by earlier transformations

Which is exactly what you want to do!

like image 25
Ricardo Saporta Avatar answered Sep 19 '22 03:09

Ricardo Saporta