The sample data frame:
no <- rep(1:5, each=2)
type <- rep(LETTERS[1:2], times=5)
set.seed(4)
value <- round(runif(10, 10, 30))
df <- data.frame(no, type, value)
df
no type value
1 1 A 22
2 1 B 10
3 2 A 16
4 2 B 16
5 3 A 26
6 3 B 15
7 4 A 24
8 4 B 28
9 5 A 29
10 5 B 11
Now what I want is to calculate the % value of each type of type (A or B) and create separate columns. Desired output is something like this:
no pct_A pct_B total_value
1 1 68.75000 31.25000 32
2 2 50.00000 50.00000 32
3 3 63.41463 36.58537 41
4 4 46.15385 53.84615 52
5 5 72.50000 27.50000 40
What I have tried so far (This gives the right output but the process seems very sub-optimal):
df %>%
group_by(no) %>%
mutate(total_value= sum(value))-> df
df %>%
mutate(pct_A=ifelse(type=='A', (value/total_value) *100, 0),
pct_B=ifelse(type=='B', (value/total_value) *100, 0)) %>%
group_by(no) %>%
summarise(pct_A=sum(pct_A),
pct_B=sum(pct_B)) %>%
ungroup() %>%
merge(df) %>%
distinct(no, .keep_all = T) %>%
select(-type, -value)
Is there any better way to do that? Especially using dplyr
?
I looked for other answers too, but no help. This one came closer:
R Create new column of values based on the factor levels of another column
You could do it in base using aggregate
.
do.call(data.frame, aggregate(value ~ no, df, \(x) c(proportions(x), sum(x)))) |>
setNames(c('no', 'pct_A', 'pct_B', 'total_value'))
# no pct_A pct_B total_value
# 1 1 0.6875000 0.3125000 32
# 2 2 0.5000000 0.5000000 32
# 3 3 0.6341463 0.3658537 41
# 4 4 0.4615385 0.5384615 52
# 5 5 0.7250000 0.2750000 40
For each no
we can calculate sum
and ratio then get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
group_by(no) %>%
mutate(total_value = sum(value),
value = prop.table(value) * 100) %>%
ungroup %>%
pivot_wider(names_from = type, values_from = value, names_prefix = 'pct_')
# no total_value pct_A pct_B
# <int> <dbl> <dbl> <dbl>
#1 1 32 68.8 31.2
#2 2 32 50 50
#3 3 41 63.4 36.6
#4 4 52 46.2 53.8
#5 5 40 72.5 27.5
Here are two more ways to do this.
We could use purrr::map_dfc
. However, setting up the correct column names is kind of cumbersome:
library(dplyr)
library(purrr)
df %>%
group_by(no) %>%
summarise(total_value = sum(value),
map_dfc(unique(type) %>% set_names(., paste0("pct_",.)),
~ sum((type == .x) * value) / total_value * 100)
)
#> # A tibble: 5 x 4
#> no total_value pct_A pct_B
#> <int> <dbl> <dbl> <dbl>
#> 1 1 32 68.8 31.2
#> 2 2 32 50 50
#> 3 3 41 63.4 36.6
#> 4 4 52 46.2 53.8
#> 5 5 40 72.5 27.5
Alternatively we can use dplyover::over
(disclaimer: I'm the maintainer) which allows us to create names on the fly in a across
-like way:
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
group_by(no) %>%
summarise(total_value = sum(value),
over(dist_values(type), # alternatively `unique(type)`
~ sum((type == .x) * value) / total_value * 100,
.names = "pct_{x}")
)
#> # A tibble: 5 x 4
#> no total_value pct_A pct_B
#> <int> <dbl> <dbl> <dbl>
#> 1 1 32 68.8 31.2
#> 2 2 32 50 50
#> 3 3 41 63.4 36.6
#> 4 4 52 46.2 53.8
#> 5 5 40 72.5 27.5
Created on 2021-09-17 by the reprex package (v2.0.1)
Performance-wise both approaches should be faster compared to data-rectangling approaches such as pivot_wider
(but I haven't tested this specific scenario).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With