Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distinct rows based on date range

Tags:

r

dplyr

I have a large-ish dataset with a structure something like the following:

structure(list(pathogen = c("MRSA", "L. pneumophila", "MRSA", 
"L. pneumophila", "MRSA", "MRSA", "MRSA", "L. pneumophila", "L. pneumophila", 
"MRSA"), variant = c("mecA", "sg1", "mecA", "sg1", "mecA", "mecC", 
"mecA", "sg1", "sg6", "mecA"), n = c(25L, 14L, 235L, 2L, 64L, 
15L, 13L, 6L, 11L, 8L), date = structure(c(15156, 15248, 15279, 
15279, 15309, 15340, 15340, 15400, 15431, 15461), class = "Date")), .Names = c("pathogen", 
"variant", "n", "date"), row.names = c(NA, -10L), class = "data.frame")

I want to find every row with a combination of variables that has not been recorded in the previous x-month period. So when I look for a combination of pathogen and variant that has not been recorded in the previous 3 months I go from:

         pathogen variant   n       date
1            MRSA    mecA  25 2011-07-01
2  L. pneumophila     sg1  14 2011-10-01
3            MRSA    mecA 235 2011-11-01
4  L. pneumophila     sg1   2 2011-11-01
5            MRSA    mecA  64 2011-12-01
6            MRSA    mecC  15 2012-01-01
7            MRSA    mecA  13 2012-01-01
8  L. pneumophila     sg1   6 2012-03-01
9  L. pneumophila     sg6  11 2012-04-01
10           MRSA    mecA   8 2012-05-01

to:

         pathogen variant   n       date
1            MRSA    mecA  25 2011-07-01
2  L. pneumophila     sg1  14 2011-10-01
3            MRSA    mecA 235 2011-11-01
6            MRSA    mecC  15 2012-01-01
8  L. pneumophila     sg1   6 2012-03-01
9  L. pneumophila     sg6  11 2012-04-01
10           MRSA    mecA   8 2012-05-01

All the solutions I've thought of so far involve writing loops. I'm also trying to use dplyr for analysis as much as possible, so my question is: is this possible in dplyr? And if not, what would an R-ish approach look like?

like image 248
stuwest Avatar asked Mar 31 '26 20:03

stuwest


1 Answers

I'm not sure about the best way to handle exact month intervalls, but to get you started, you could compute the difference in number of days as follows (resulting in the same output as shown in the question):

df %>%
  group_by(pathogen, variant) %>%
  filter(c(TRUE, diff(date) > 90))   # check for difference of 90 days
#Source: local data frame [7 x 4]
#Groups: pathogen, variant
#
#        pathogen variant   n       date
#1           MRSA    mecA  25 2011-07-01
#2 L. pneumophila     sg1  14 2011-10-01
#3           MRSA    mecA 235 2011-11-01
#4           MRSA    mecC  15 2012-01-01
#5 L. pneumophila     sg1   6 2012-03-01
#6 L. pneumophila     sg6  11 2012-04-01
#7           MRSA    mecA   8 2012-05-01
like image 106
talat Avatar answered Apr 02 '26 13:04

talat



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!