Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to restructure dataframe having multiple variable by keeping some variable static & dynamic in R

Tags:

dataframe

r

dplyr

I am using the following dataframe in R.

uid     Date                  batch_no       marking       seq
K-1     16/03/2020  12:11:33  7              S1            FRD
K-1     16/03/2020  12:11:33  7              S1            FHL
K-2     16/03/2020  12:11:33  8              SE_hold1      ABC
K-3     16/03/2020  12:11:33  9              SD_hold2      DEF
K-4     16/03/2020  12:11:33  8              S1            XYZ
K-5     16/03/2020  12:11:33                 NA            ABC
K-6     16/03/2020  12:11:33  7                            ZZZ
K-7     16/03/2020  12:11:33  NA             S2            NA
K-8     16/03/2020  12:11:33  6              S3            FRD
  • The seq column will have eight unique value including NA; it's not necessary that all 8 values are available for every day's date.
  • batch_no will have six unique values including NA and blank; it's not necessary that all six values are available for every day's date.
  • The marking column will have ~ 25 unique value, but need to consider values with suffix _hold# as Hold; after that, there would be six unique value including blank and NA.

The requirement is to merge the dcast dataframe in the following order to have a single view summary for an analysis.

I want to keep all the unique values static in the code, so that if the particular value is not available for a particular date I'll get 0 or - in summary table.

Desired Output:

seq      count  percentage   Marking     count     Percentage     batch_no   count    Percentage
FRD      1      12.50%       S1          2         25.00%         6          1        12.50%
FHL      1      12.50%       S2          1         12.50%         7          2        25.00%
ABC      2      25.00%       S3          1         12.50%         8          2        25.00%
DEF      1      12.50%       Hold        2         25.00%         9          1        12.50%
XYZ      1      12.50%       NA          1         12.50%         NA         1        12.50%
ZZZ      1      12.50%       (Blank)     1         12.50%         (Blank)    1        12.50%
FRD      1      12.50%         -         -           -             -         -           -
NA       1      12.50%         -         -           -             -         -           -
(Blank)  0      0.00%          -         -           -             -         -           -
Total    8      112.50%        -         8         100.00%         -         8         100.00%

For seq we have % > 100 because of double counting of same uid for value FRD and FHL. That is the accepted scenario. In Total will have only distinct count of uid.

like image 222
Sophia Wilson Avatar asked Apr 09 '20 02:04

Sophia Wilson


1 Answers

There are a few ways of approaching this problem, one route would be starting with cleaning your data, joining that onto a table that has all the combinations you explicitly want and then summarising. NB: this will give a lot of explicit 0's due to combining the combinations from those three columns.

df = df_original %>% 
  mutate(marking = if_else(str_detect(marking,"hold"),"Hold", marking)) %>% 
  mutate_at(vars(c("seq", "batch_no", "marking")), forcats::fct_explicit_na, na_level = "(Blank)") 

## You need to do something similar with vectors of the possible values
## i.e. I don't know all the levels of your factors
#--------------------------------------------------------------------------
# Appending the NA and (Blank) levels ensures they are included in case the
# batch of data doesn't have them

df_seq = data.frame(seq = c(df$seq %>% levels(),"NA","(Blank)") %>% unique())
df_batch_no = data.frame(batch_no = c(df$batch_no %>% levels(),"NA","(Blank)") %>% unique())
df_marking = data.frame(marking = c(df$marking %>% levels(),"NA","(Blank)") %>% unique())

# would have been really nice to use janitor::tabyl but your output won't allow

df_seq_summary = df %>%
  group_by(seq) %>% 
  summarise(count = n()) %>% 
  right_join(df_seq, by = "seq") %>% 
  mutate(count = replace_na(count, 0),
  percentage = count / n()) %>% 
  mutate(row = row_number())

df_marking_summary =  df %>%
  group_by(marking) %>% 
  summarise(count = n()) %>% 
  right_join(df_marking, by = "marking") %>% 
  mutate(count = replace_na(count, 0),
         percentage = count / sum(count)) %>% 
  mutate(row = row_number())

df_batch_no_summary =  df %>%
  group_by(batch_no) %>% 
  summarise(count = n()) %>% 
  right_join(df_batch_no, by = "batch_no") %>% 
  mutate(count = replace_na(count, 0),
         percentage = count / sum(count)) %>% 
  mutate(row = row_number())

df = df_seq_summary %>% 
  full_join(df_marking_summary, by =  "row", suffix = c("", "_marking")) %>% 
  full_join(df_batch_no_summary, by =  "row", suffix = c("", "_batch_no")) %>% 
  select(-row) %>% 
bind_rows(summarise_all(., ~(if(is.numeric(.)) sum(if_else(.>0,as.double(.),0), na.rm = T) else "Total"))) %>% 
  mutate_at(vars(contains("percentage")), scales::percent, accuracy = 0.01)
like image 183
MMerry Avatar answered Oct 13 '22 00:10

MMerry