Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to Prevent Overlapping Ranges Within (Nested) Summary Groups

Tags:

r

I am working with the R programming language.

I have the following dataset on medical characteristics of patients and disease prevalance:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)


status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  <- as.factor(status )

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

################

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

###################
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

  Patient_ID gender    status   height    weight hospital_visits disease
1          1 Female   Citizen 145.0583 113.70725               1    TRUE
2          2   Male Immigrant 161.2759  88.33188              18   FALSE
3          3 Female Immigrant 138.5305  99.26961               6   FALSE
4          4   Male   Citizen 164.8102  84.31848              12    TRUE
5          5   Male   Citizen 159.1619  92.25090              12    TRUE
6          6 Female   Citizen 153.3513 101.31986              11    TRUE

 

My Problem: Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.

  • First, select all males
  • Then, select all male citizens
  • Then, out of the set of all male citizens - identify a group of 20% of this set with the smallest heights
  • Then, out of the set of all male citizens within the shortest 20% height - further isolate a group of 20% with the smallest weights
  • Finally, out of the set of all male citizens within the shortest 20% height and within the shortest 20% height having the 20% smallest weight - further isolate them into a group with the 20% fewest number of hospital visits : This will now be the first group
  • Repeat this process for all possible group combinations

This means an individual patient can only belong to a single group - in other words, the groups can not have any overlapping ranges.

In a previous question (R: Merging a Lookup Table with a Data Frame), I learned how to calculate the disease proportions within nested groups:

final = my_data |>
    group_by(gender, status) |>
    mutate(low_height = height < quantile(height, .2)) |>
    group_by(gender, status, low_height) |>
    mutate(low_weight = weight < quantile(weight, .2)) |>
    group_by(gender,  status, low_height, low_weight) |>
    mutate(low_visit = hospital_visits  < quantile(hospital_visits , .2)) |>
    group_by(gender, status, low_height, low_weight, low_visit) |>
    summarise(across(c(height, weight, hospital_visits),
                     ## list custom stats here:
                     list(min = \(xs) min(xs, na.rm = TRUE),
                          max = \(xs) max(xs, na.rm = TRUE)
                     ),
                     .names = "{.col}_{.fn}"
    ),
    prop_disease = sum(disease)/n(),
    ## etc.
    )

final$low_height = final$low_weight = final$low_visit = NULL

My Question When I look at the results from this code:

> final
# A tibble: 32 x 9
# Groups:   gender, status [4]
   gender status    height_min height_max weight_min weight_max hospital_visits_min hospital_visits_max prop_disease
   <fct>  <fct>          <dbl>      <dbl>      <dbl>      <dbl>               <int>               <int>        <dbl>
 1 Female Citizen         142.       188.       82.3      119.                    4                  20        0.495
 2 Female Citizen         142.       175.       82.4      118.                    1                   3        0.495
 3 Female Citizen         142.       177.       57.4       82.3                   5                  20        0.482        

   
  • In row 1 and row 2: I can see that overlapping height ranges have been created such as (142,188) and (142, 175)
  • This means that if a medical patient was 150 cm tall - the patient could be assigned to both of these groups: as such, this violates the original condition of non-overlapping groups.

Can someone please show me if there is a way to fix this problem?

Thanks!

like image 426
stats_noob Avatar asked Sep 13 '25 23:09

stats_noob


1 Answers

First of all, as said in the comment by @I_O, I agree that height and weight may not be absolutely separated to avoid the overlap.

If you want want to have a summary of all combinations with quantile(... < 0.2) for the features of interest, probably you can try

df %>%
    group_by(gender, status) %>%
    group_map(~ {
        .x %>%
            filter(height < quantile(height, .2)) %>%
            filter(weight < quantile(weight, .2)) %>%
            filter(hospital_visits < quantile(hospital_visits, .2)) %>%
            summarise(
                across(c(height, weight, hospital_visits),
                    list(
                        min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                    ),
                    .names = "{.col}_{.fn}"
                ),
                prop_disease = mean(disease),
                .by = c(gender, status)
            )
    }, .keep = TRUE) %>%
    bind_rows()

which gives

# A tibble: 4 × 9
  gender status  height_min height_max weight_min weight_max hospital_visits_min
  <fct>  <fct>        <dbl>      <dbl>      <dbl>      <dbl>               <int>
1 Female Citizen       123.       141.       70.7       80.6                   1
2 Female Immigr…       124.       140.       62.0       83.5                   2
3 Male   Citizen       125.       140.       66.7       79.8                   1
4 Male   Immigr…       130.       140.       68.4       81.1                   1
# ℹ 2 more variables: hospital_visits_max <int>, prop_disease <dbl>
like image 197
ThomasIsCoding Avatar answered Sep 15 '25 14:09

ThomasIsCoding