Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract dataframe-entries based on date

Tags:

r

I have a dataset that has the following form

 V1   V2    V3          V4
999   53 2015-07-02     2
999   53 2011-07-03     3
998   56 2015-03-08     4
998   56 2011-03-18     5
998   58 2014-12-26     6
998   57 2016-05-21     8
998   57 2015-04-12     9
998   58 2013-09-29     10
997   63 2013-09-28     19
997   63 2014-08-21     20

Note that duplicates always appear in columns V1 and V2 ( (999, 53) and (998,56) and so on). Also note that V3 is a date. So the two entries making up a duplicate appear at two different times.

I would like to create two dataframes from the above dataset, one with the early entries of the duplicates and one with the old entrires. I.e., I would like to end up with the following two dataframes

the "old"

999   53 2011-07-03     3
998   56 2011-03-18     5
998   57 2015-04-12     9
998   58 2013-09-29     10
997   63 2013-09-28     19

and "early"

999   53 2015-07-02     2
998   56 2015-03-08     4
998   58 2014-12-26     6
998   57 2016-05-21     8
997   63 2014-08-21     20

I can of course use two for-loops for this, but my data is quite large so it will be inefficient. Are there other ways to achieve this?

like image 333
BillyJean Avatar asked Apr 17 '26 21:04

BillyJean


2 Answers

As Jealie pointed out in comments, for these solutions, df would have to be sorted on V3 first.

df = df[order(df$V3),]
  1. You could just split at once

    split(df, duplicated(df[,1:2]))

  2. OR use duplicated with V1 and V2 to subset separately

    df[!duplicated(df[,1:2]),]
    df[duplicated(df[,1:2]),]

  3. OR Use ave to determine if a duplicate pair is appearing for the first time or second time and subset directly.

    df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) ==1,]
    df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) == 2,]

DATA

df = structure(list(V1 = c(999L, 999L, 998L, 998L, 998L, 998L, 998L, 
998L, 997L, 997L), V2 = c(53L, 53L, 56L, 56L, 58L, 57L, 57L, 
58L, 63L, 63L), V3 = c("2015-07-02", "2011-07-03", "2015-03-08", 
"2011-03-18", "2014-12-26", "2016-05-21", "2015-04-12", "2013-09-29", 
"2013-09-28", "2014-08-21"), V4 = c(2L, 3L, 4L, 5L, 6L, 8L, 9L, 
10L, 19L, 20L)), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame",
row.names = c(NA, -10L))
like image 110
d.b Avatar answered Apr 19 '26 12:04

d.b


As long as you only have pairs, this will work.

# get the positions of the rows sorted by V2 and then V3
myOrd <- with(df, order(V2, V3))

# Keep the first observation of each pair (early)
df[myOrd[c(TRUE, FALSE)],]
   V1 V2         V3 V4
2 999 53 2011-07-03  3
4 998 56 2011-03-18  5
7 998 57 2015-04-12  9
8 998 58 2013-09-29 10
9 997 63 2013-09-28 19

# Keep the second observation of each pair (late)
df[myOrd[c(FALSE, TRUE)],]
    V1 V2         V3 V4
1  999 53 2015-07-02  2
3  998 56 2015-03-08  4
6  998 57 2016-05-21  8
5  998 58 2014-12-26  6
10 997 63 2014-08-21 20

Here, order is used to find the positions of the sorted observations. Then c(TRUE, FALSE) and c(FALSE, TRUE) are used to extract the desired rows.

like image 41
lmo Avatar answered Apr 19 '26 12:04

lmo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!