Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly sampling subsets of dataframe variable

Tags:

r

plyr

I am working on a large dataset which comprises travel behaviour data over a weekly period. Over the course of a week, people have completed a log of the individual trips that they have taken place during that week. Individuals are identified by a unique identification number (ID). What I want to do is ramdonly choose two days of diary data (which may comprise one or many trips) from the weekly data that is available for each unique ID, and put this in a new data frame. An example data frame is detailed below:

Df1 <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3), 
                  date = c("1st Nov", "1st Nov", "3rd Nov", "4th Nov","4th Nov","5th Nov","2nd Nov", "2nd Nov", "3nd Nov", "4th Nov","5th Nov","5th Nov","2nd Nov", "2nd Nov", "3nd Nov", "4th Nov","5th Nov"))

Any help on the above would be gratefully received.

Many thanks,

Katie

like image 653
KT_1 Avatar asked Dec 07 '11 11:12

KT_1


1 Answers

Sounds like a job for plyr. To sample two random days for each user:

library(plyr)
ddply(Df1, .(ID), function(x) {
  unique_days = as.character(unique(x$date))
  if(length(unique_days) < 2) {
    randomSelDays = unique_days
  } else {
    randomSelDays = sample(unique_days, 2)        
  }
  return(x[x$date %in% randomSelDays,])
})

This returns all the data for two selected days per unique identifier. In addition, if there is just one day for an ID, that day is returned. For example:

  ID    date
1  1 1st Nov
2  1 1st Nov
3  1 3rd Nov
4  2 3nd Nov
5  2 5th Nov
6  2 5th Nov
7  3 2nd Nov
8  3 2nd Nov
9  3 3nd Nov
like image 177
Paul Hiemstra Avatar answered Nov 15 '22 04:11

Paul Hiemstra