I have a dataset that has the following form
V1 V2 V3 V4
999 53 2015-07-02 2
999 53 2011-07-03 3
998 56 2015-03-08 4
998 56 2011-03-18 5
998 58 2014-12-26 6
998 57 2016-05-21 8
998 57 2015-04-12 9
998 58 2013-09-29 10
997 63 2013-09-28 19
997 63 2014-08-21 20
Note that duplicates always appear in columns V1 and V2 ( (999, 53) and (998,56) and so on). Also note that V3 is a date. So the two entries making up a duplicate appear at two different times.
I would like to create two dataframes from the above dataset, one with the early entries of the duplicates and one with the old entrires. I.e., I would like to end up with the following two dataframes
the "old"
999 53 2011-07-03 3
998 56 2011-03-18 5
998 57 2015-04-12 9
998 58 2013-09-29 10
997 63 2013-09-28 19
and "early"
999 53 2015-07-02 2
998 56 2015-03-08 4
998 58 2014-12-26 6
998 57 2016-05-21 8
997 63 2014-08-21 20
I can of course use two for-loops for this, but my data is quite large so it will be inefficient. Are there other ways to achieve this?
As Jealie pointed out in comments, for these solutions, df would have to be sorted on V3 first.
df = df[order(df$V3),]
You could just split at once
split(df, duplicated(df[,1:2]))
OR use duplicated with V1 and V2 to subset separately
df[!duplicated(df[,1:2]),]
df[duplicated(df[,1:2]),]
OR Use ave to determine if a duplicate pair is appearing for the first time or second time and subset directly.
df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) ==1,]
df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) == 2,]
DATA
df = structure(list(V1 = c(999L, 999L, 998L, 998L, 998L, 998L, 998L,
998L, 997L, 997L), V2 = c(53L, 53L, 56L, 56L, 58L, 57L, 57L,
58L, 63L, 63L), V3 = c("2015-07-02", "2011-07-03", "2015-03-08",
"2011-03-18", "2014-12-26", "2016-05-21", "2015-04-12", "2013-09-29",
"2013-09-28", "2014-08-21"), V4 = c(2L, 3L, 4L, 5L, 6L, 8L, 9L,
10L, 19L, 20L)), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame",
row.names = c(NA, -10L))
As long as you only have pairs, this will work.
# get the positions of the rows sorted by V2 and then V3
myOrd <- with(df, order(V2, V3))
# Keep the first observation of each pair (early)
df[myOrd[c(TRUE, FALSE)],]
V1 V2 V3 V4
2 999 53 2011-07-03 3
4 998 56 2011-03-18 5
7 998 57 2015-04-12 9
8 998 58 2013-09-29 10
9 997 63 2013-09-28 19
# Keep the second observation of each pair (late)
df[myOrd[c(FALSE, TRUE)],]
V1 V2 V3 V4
1 999 53 2015-07-02 2
3 998 56 2015-03-08 4
6 998 57 2016-05-21 8
5 998 58 2014-12-26 6
10 997 63 2014-08-21 20
Here, order is used to find the positions of the sorted observations. Then c(TRUE, FALSE) and c(FALSE, TRUE) are used to extract the desired rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With