Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R keep rows with maximum value of one column when multiple rows have values close to each other in an other column

Tags:

r

I have a data frame with dates and magnitudes. For every case where the dates are within 0.6 years from each other, I want to keep the date with the highest absolute magnitude and discard the other.

  • This includes cases where multiple dates are all within 0.6 years from each other. Like c(2014.2, 2014.4, 2014.5) which should give `c(2014.4) if that year had the highest absolute magnitude.
  • For cases where multiple years could be chained using this criterion (like c(2016.3, 2016.7, 2017.2), where 2016.3 and 2017.2 are not within 0.6 years from each other), I want to treat the dates that are closest to one another as a pair and consider the extra date in the criterion as a next candidate for another pair, (so the output will read like this c(2016.3, 2016.7, 2017.2) if 2016.3 had the highest absolute magnitude).

data:

set.seed(1)
library(dplyr)
mydf <- data.frame(date = c(2014.25, 2014.41, 2014.53, 2016.3, 
                            2016.7,2017.2,2018.5, 2019.35, 2019.8),
                   magnitude = round(rnorm(9, mean=0, sd=0.4),4))
mydf <- mydf %>% mutate(absmag = abs(magnitude))
mydf
> mydf
     date magnitude absmag
1 2014.25   -0.1222 0.1222
2 2014.41    0.6047 0.6047
3 2014.53    0.1559 0.1559
4 2016.30   -0.2485 0.2485
5 2016.70   -0.8859 0.8859
6 2017.20    0.4500 0.4500
7 2018.50   -0.0180 0.0180
8 2019.35   -0.0065 0.0065
9 2019.80    0.3775 0.3775

Desired output:

     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2016.70   -0.8859 0.8859
3 2017.20    0.4500 0.4500
4 2018.50   -0.0180 0.0180
5 2019.80    0.3775 0.3775

The things I tried so far failed to incoropate the requirements in the bullet points.

This solution can only handle pairs of two items within 0.6 years from each other:

whichAreClose <- function(your.number, x, threshold = 0.6){
  x[which(abs(x - your.number) != 0 & abs(x - your.number) < thresh)]}
out1 <- sapply(mydf$date, 
                FUN = whichAreClose, 
                x = mydf$date) %>% 
  unlist() %>% 
  split(., cut(seq_along(.), 2, labels = FALSE)) %>% 
  lapply(
    ., function(i){
      mydf %>% 
        filter(date %in% i) %>% 
        slice_min(absmag)}) %>% 
  bind_rows(.) %>% 
  anti_join(mydf, .)

> out1
     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2014.53    0.1559 0.1559
3 2016.30   -0.2485 0.2485
4 2016.70   -0.8859 0.8859
5 2017.20    0.4500 0.4500
6 2018.50   -0.0180 0.0180
7 2019.80    0.3775 0.3775

and this solution cannot distinguish different pairs at all:

out2 <- mydf %>% 
  mutate(prevdist = abs(date - lag(date)),
         nextdist = abs(date - lead(date)),
         ispair = case_when(prevdist < 0.6 ~ 'yes',
                            nextdist < 0.6 ~ 'yes',
                            TRUE ~ 'no')) %>% 
  filter(ispair == 'yes') %>% 
  slice_min(absmag) %>% 
  anti_join(mydf, .)

> out2
     date magnitude absmag
1 2014.25   -0.5883 0.5883
2 2014.41   -0.1913 0.1913
3 2014.53    0.1672 0.1672
4 2016.30    0.5435 0.5435
5 2017.20    0.1551 0.1551
6 2018.50   -0.0215 0.0215
7 2019.35   -0.5508 0.5508
8 2019.80   -0.1660 0.1660

P.S.: feel free to edit the title. I struggled to come up with a good one myself.

like image 921
saQuist Avatar asked Oct 30 '25 06:10

saQuist


2 Answers

You can try to perform complete clustering on dates by using hclust. The manhattan (i.e. absolute) distances are calculated between pairs of dates. The "complete" clustering method will ensure that every member of a cluster cut at h height will be distant at most h from the other members.

mydf |>
  mutate(k = {
    dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  })

#>     date magnitude absmag k
#>1 2014.25   -0.2506 0.2506 1
#>2 2014.41    0.0735 0.0735 1
#>3 2014.53   -0.3343 0.3343 1
#>4 2016.30    0.6381 0.6381 2
#>5 2016.70    0.1318 0.1318 2
#>6 2017.20   -0.3282 0.3282 3
#>7 2018.50    0.1950 0.1950 4
#>8 2019.35    0.2953 0.2953 5
#>9 2019.80    0.2303 0.2303 5

Here is shown the dendrogram obtained:

dendrogram

mydf |>
  mutate(k = {
    dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  }) |>
  group_by(k) |>
  filter(absmag == max(absmag)) |>
  as.data.frame()

#>      date magnitude absmag k
#> 1 2014.53   -0.3343 0.3343 1
#> 2 2016.30    0.6381 0.6381 2
#> 3 2017.20   -0.3282 0.3282 3
#> 4 2018.50    0.1950 0.1950 4
#> 5 2019.35    0.2953 0.2953 5

like image 165
Stefano Barbi Avatar answered Nov 03 '25 00:11

Stefano Barbi


With cumsum and purrr::accumulate.

library(tidyverse)

mydf %>% 
  group_by(cum = cumsum(accumulate(c(0, diff(date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0)) == 0)) %>% 
  slice_max(absmag)

     date magnitude absmag    cum
1 2014.53   -0.3343 0.3343      1
2 2016.30    0.6381 0.6381      2
3 2017.20   -0.3282 0.3282      3
4 2018.50    0.1950 0.1950      4
5 2019.35    0.2953 0.2953      5

Explanation:

purrr:accumulate takes the difference between the values of the vectors as input, and if the cumulative sum (.x + .y) is below 0.6, it outputs the cumulative sum (for instance for the first three elements), but if it's higher than 0.6, the cumsum resets to 0. So for the fourth element, because 0.28 + 1.77 > 0.6, the function sets the fourth element to 0.

c(0, diff(mydf$date))
# [1] 0.00 0.16 0.12 1.77 0.40 0.50 1.30 0.85 0.45

accumulate(c(0, diff(mydf$date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0))
# [1] 0.00 0.16 0.28 0.00 0.40 0.00 0.00 0.00 0.45

Everytime the cumsum resets to 0, it is actually a new group with subsequent rows being below the thresholds. So, to capture groups, one can use cumsum(... == 0):

cumsum(accumulate(c(0, diff(mydf$date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0)) == 0)
# [1] 1 1 1 2 2 3 4 5 5
like image 24
Maël Avatar answered Nov 02 '25 23:11

Maël



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!