I am working with the R programming language.
I have the following dataset on medical characteristics of patients and disease prevalance:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20, 5000, replace = TRUE)
################
disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)
###################
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)
Patient_ID gender status height weight hospital_visits disease
1 1 Female Citizen 145.0583 113.70725 1 TRUE
2 2 Male Immigrant 161.2759 88.33188 18 FALSE
3 3 Female Immigrant 138.5305 99.26961 6 FALSE
4 4 Male Citizen 164.8102 84.31848 12 TRUE
5 5 Male Citizen 159.1619 92.25090 12 TRUE
6 6 Female Citizen 153.3513 101.31986 11 TRUE
My Problem: Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.
This means an individual patient can only belong to a single group - in other words, the groups can not have any overlapping ranges.
In a previous question (R: Merging a Lookup Table with a Data Frame), I learned how to calculate the disease proportions within nested groups:
final = my_data |>
group_by(gender, status) |>
mutate(low_height = height < quantile(height, .2)) |>
group_by(gender, status, low_height) |>
mutate(low_weight = weight < quantile(weight, .2)) |>
group_by(gender, status, low_height, low_weight) |>
mutate(low_visit = hospital_visits < quantile(hospital_visits , .2)) |>
group_by(gender, status, low_height, low_weight, low_visit) |>
summarise(across(c(height, weight, hospital_visits),
## list custom stats here:
list(min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = sum(disease)/n(),
## etc.
)
final$low_height = final$low_weight = final$low_visit = NULL
My Question When I look at the results from this code:
> final
# A tibble: 32 x 9
# Groups: gender, status [4]
gender status height_min height_max weight_min weight_max hospital_visits_min hospital_visits_max prop_disease
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 Female Citizen 142. 188. 82.3 119. 4 20 0.495
2 Female Citizen 142. 175. 82.4 118. 1 3 0.495
3 Female Citizen 142. 177. 57.4 82.3 5 20 0.482
Can someone please show me if there is a way to fix this problem?
Thanks!
First of all, as said in the comment by @I_O, I agree that height and weight may not be absolutely separated to avoid the overlap.
If you want want to have a summary of all combinations with quantile(... < 0.2)
for the features of interest, probably you can try
df %>%
group_by(gender, status) %>%
group_map(~ {
.x %>%
filter(height < quantile(height, .2)) %>%
filter(weight < quantile(weight, .2)) %>%
filter(hospital_visits < quantile(hospital_visits, .2)) %>%
summarise(
across(c(height, weight, hospital_visits),
list(
min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = mean(disease),
.by = c(gender, status)
)
}, .keep = TRUE) %>%
bind_rows()
which gives
# A tibble: 4 × 9
gender status height_min height_max weight_min weight_max hospital_visits_min
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Female Citizen 123. 141. 70.7 80.6 1
2 Female Immigr… 124. 140. 62.0 83.5 2
3 Male Citizen 125. 140. 66.7 79.8 1
4 Male Immigr… 130. 140. 68.4 81.1 1
# ℹ 2 more variables: hospital_visits_max <int>, prop_disease <dbl>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With