Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I create new columns based on the values of a different column and count the percentage value of another numerical column in R?

The sample data frame:

no <- rep(1:5, each=2)
type <- rep(LETTERS[1:2], times=5)
set.seed(4)
value <- round(runif(10, 10, 30))

df <- data.frame(no, type, value)
df

    no type value
1   1    A    22
2   1    B    10
3   2    A    16
4   2    B    16
5   3    A    26
6   3    B    15
7   4    A    24
8   4    B    28
9   5    A    29
10  5    B    11

Now what I want is to calculate the % value of each type of type (A or B) and create separate columns. Desired output is something like this:

 no    pct_A    pct_B total_value
1  1 68.75000 31.25000          32
2  2 50.00000 50.00000          32
3  3 63.41463 36.58537          41
4  4 46.15385 53.84615          52
5  5 72.50000 27.50000          40

What I have tried so far (This gives the right output but the process seems very sub-optimal):

df %>%
  group_by(no) %>%
  mutate(total_value= sum(value))-> df

df %>%
  mutate(pct_A=ifelse(type=='A', (value/total_value) *100, 0),
         pct_B=ifelse(type=='B', (value/total_value) *100, 0)) %>%
  group_by(no) %>%
  summarise(pct_A=sum(pct_A),
            pct_B=sum(pct_B)) %>%
  ungroup() %>%
  merge(df) %>%
  distinct(no, .keep_all = T) %>%
  select(-type, -value)

Is there any better way to do that? Especially using dplyr?

I looked for other answers too, but no help. This one came closer:

R Create new column of values based on the factor levels of another column

like image 852
Shibaprasadb Avatar asked Sep 17 '21 06:09

Shibaprasadb


3 Answers

You could do it in base using aggregate.

do.call(data.frame, aggregate(value ~ no, df, \(x) c(proportions(x), sum(x)))) |>
  setNames(c('no', 'pct_A', 'pct_B', 'total_value'))
#   no     pct_A     pct_B total_value
# 1  1 0.6875000 0.3125000          32
# 2  2 0.5000000 0.5000000          32
# 3  3 0.6341463 0.3658537          41
# 4  4 0.4615385 0.5384615          52
# 5  5 0.7250000 0.2750000          40
like image 62
jay.sf Avatar answered Oct 20 '22 05:10

jay.sf


For each no we can calculate sum and ratio then get the data in wide format.

library(dplyr)
library(tidyr)

df %>%
  group_by(no) %>%
  mutate(total_value = sum(value),
         value = prop.table(value) * 100) %>%
  ungroup %>%
  pivot_wider(names_from = type, values_from = value, names_prefix = 'pct_')

#     no total_value pct_A pct_B
#  <int>       <dbl> <dbl> <dbl>
#1     1          32  68.8  31.2
#2     2          32  50    50  
#3     3          41  63.4  36.6
#4     4          52  46.2  53.8
#5     5          40  72.5  27.5
like image 1
Ronak Shah Avatar answered Oct 20 '22 06:10

Ronak Shah


Here are two more ways to do this.

We could use purrr::map_dfc. However, setting up the correct column names is kind of cumbersome:

library(dplyr)
library(purrr)

df %>% 
  group_by(no) %>% 
  summarise(total_value = sum(value),
            map_dfc(unique(type) %>% set_names(., paste0("pct_",.)), 
                    ~ sum((type == .x) * value) / total_value * 100)
  )

#> # A tibble: 5 x 4
#>      no total_value pct_A pct_B
#>   <int>       <dbl> <dbl> <dbl>
#> 1     1          32  68.8  31.2
#> 2     2          32  50    50  
#> 3     3          41  63.4  36.6
#> 4     4          52  46.2  53.8
#> 5     5          40  72.5  27.5

Alternatively we can use dplyover::over (disclaimer: I'm the maintainer) which allows us to create names on the fly in a across-like way:

library(dplyover) # https://github.com/TimTeaFan/dplyover

df %>% 
  group_by(no) %>% 
  summarise(total_value = sum(value),
            over(dist_values(type), # alternatively `unique(type)`
                 ~ sum((type == .x) * value) / total_value * 100,
                 .names = "pct_{x}")
            )

#> # A tibble: 5 x 4
#>      no total_value pct_A pct_B
#>   <int>       <dbl> <dbl> <dbl>
#> 1     1          32  68.8  31.2
#> 2     2          32  50    50  
#> 3     3          41  63.4  36.6
#> 4     4          52  46.2  53.8
#> 5     5          40  72.5  27.5

Created on 2021-09-17 by the reprex package (v2.0.1)

Performance-wise both approaches should be faster compared to data-rectangling approaches such as pivot_wider (but I haven't tested this specific scenario).

like image 1
TimTeaFan Avatar answered Oct 20 '22 07:10

TimTeaFan