Extract dataframe-entries based on date

Question

I have a dataset that has the following form

 V1   V2    V3          V4
999   53 2015-07-02     2
999   53 2011-07-03     3
998   56 2015-03-08     4
998   56 2011-03-18     5
998   58 2014-12-26     6
998   57 2016-05-21     8
998   57 2015-04-12     9
998   58 2013-09-29     10
997   63 2013-09-28     19
997   63 2014-08-21     20

Note that duplicates always appear in columns V1 and V2 ( (999, 53) and (998,56) and so on). Also note that V3 is a date. So the two entries making up a duplicate appear at two different times.

I would like to create two dataframes from the above dataset, one with the early entries of the duplicates and one with the old entrires. I.e., I would like to end up with the following two dataframes

the "old"

999   53 2011-07-03     3
998   56 2011-03-18     5
998   57 2015-04-12     9
998   58 2013-09-29     10
997   63 2013-09-28     19

and "early"

999   53 2015-07-02     2
998   56 2015-03-08     4
998   58 2014-12-26     6
998   57 2016-05-21     8
997   63 2014-08-21     20

I can of course use two for-loops for this, but my data is quite large so it will be inefficient. Are there other ways to achieve this?

d.b · Accepted Answer

As Jealie pointed out in comments, for these solutions, df would have to be sorted on V3 first.

df = df[order(df$V3),]

You could just split at once

split(df, duplicated(df[,1:2]))
OR use duplicated with V1 and V2 to subset separately

df[!duplicated(df[,1:2]),]
df[duplicated(df[,1:2]),]
OR Use ave to determine if a duplicate pair is appearing for the first time or second time and subset directly.

df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) ==1,]
df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) == 2,]

DATA

df = structure(list(V1 = c(999L, 999L, 998L, 998L, 998L, 998L, 998L, 
998L, 997L, 997L), V2 = c(53L, 53L, 56L, 56L, 58L, 57L, 57L, 
58L, 63L, 63L), V3 = c("2015-07-02", "2011-07-03", "2015-03-08", 
"2011-03-18", "2014-12-26", "2016-05-21", "2015-04-12", "2013-09-29", 
"2013-09-28", "2014-08-21"), V4 = c(2L, 3L, 4L, 5L, 6L, 8L, 9L, 
10L, 19L, 20L)), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame",
row.names = c(NA, -10L))

lmo · Answer

As long as you only have pairs, this will work.

# get the positions of the rows sorted by V2 and then V3
myOrd <- with(df, order(V2, V3))

# Keep the first observation of each pair (early)
df[myOrd[c(TRUE, FALSE)],]
   V1 V2         V3 V4
2 999 53 2011-07-03  3
4 998 56 2011-03-18  5
7 998 57 2015-04-12  9
8 998 58 2013-09-29 10
9 997 63 2013-09-28 19

# Keep the second observation of each pair (late)
df[myOrd[c(FALSE, TRUE)],]
    V1 V2         V3 V4
1  999 53 2015-07-02  2
3  998 56 2015-03-08  4
6  998 57 2016-05-21  8
5  998 58 2014-12-26  6
10 997 63 2014-08-21 20

Here, order is used to find the positions of the sorted observations. Then c(TRUE, FALSE) and c(FALSE, TRUE) are used to extract the desired rows.

Extract dataframe-entries based on date

Tags:

r

BillyJean

2 Answers

d.b

lmo

Recent Activity

Donate For Us

Extract dataframe-entries based on date

Tags:

r

BillyJean

2 Answers

d.b

lmo

Related questions

Recent Activity

Donate For Us