Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create matrix of distribution in R

I have the following dataframe in R.

ID         Date                  List             Type
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      TR1
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      RES
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      FTT
P-10013    2020-04-12 17:10:05                    TR1
P-10013    2020-04-12 17:10:05                    FTT
P-10013    2020-04-12 17:10:05                    ZXR
P-10014    2020-04-10 04:30:19   ABD,BCD          TR1
P-10014    2020-04-10 04:30:19   ABD,BCD          ZXR
P-10015    2020-04-10 14:13:15   ABC              
P-10016    2020-04-10 13:13:15   
P-10017    2020-03-18 10:13:15   ABC,ABD,BCD      TR1



dput(df)
df <- structure(list(ID = c("P-10012", "P-10012", 
"P-10012", "P-10013", "P-10013", "P-10013", 
"P-10014", "P-10014", "P-10015", "P-10016", 
"P-10017"), Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15", 
"2020-04-15 12:13:15", "2020-04-12 17:10:05", "2020-04-12 17:10:05", 
"2020-04-12 17:10:05", "2020-04-10 04:30:19", "2020-04-10 04:30:19", 
"2020-04-10 14:13:15", "2020-04-10 13:13:15", "2020-03-18 10:13:15"
), Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"), List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD", 
"", "", "", "ABD,BCD", "ABD,BCD", "ABC", "", "ABC,ABD,BCD")), class = "data.frame", row.names = c(NA, 
-11L))

The structure of the dataframe is that it will always have the same List value for a particular ID in case if there are multiple row available for that particular ID because it has multiple different value in Type. If for a particular ID there is only 1 Type value then it will always have one row.

I need to create the following distribution for the month of Apr-20 of List values as mentioned in the comma separated manner and Type values.

Where, the first 7 rows in my Required Df are distinct count of ID based on the condition (i.e whether List or Type are blank or not) and distribution across all unique List and Type value. For these 7 rows the Distinct_Count should be divided by Total to get Percentage. However, from 8th row and onward if the unique value is form List then it should be divided by total distinct count of Non_Blank_List and if the value is from Type then it should be divided by total distinct count of Non_Blank_Type.

In the following matrix, I just want to understand what is the distribution of unique values of List and Type distinctly and in combination with other values.

Please not that for the example purpose I have simplified the List and Type values in 3 and 4 unique values respectively but in my actual dataframe it is quite high and it varies from month to month so please don't hard code the values.

I have tried multiple approach but couldn't achieve the required output yet.

Required Df<-

APR-21           Distinct_Count    Percentage    ABC     ABD      BCD     TR1    RES    FTT     ZXR
Total_ID         5                 100.00%       2       2        2       3      1      2       2 
Blank_List       2                 40.00%        0       0        0       1      0      1       1
Blank_Type       2                 40.00%        1       0        0       0      0      0       0
Both_Blank       1                 20.00%        0       0        0       0      0      0       0
Non_Blank_List   3                 60.00%        2       2        2       2      1      1       1         
Non_Blank_Type   3                 60.00%        1       2        2       3      1      1       2
Both_Non_Blank   2                 40.00%        1       2        2       2      1      1       1
ABC              1                 33.33%        2       1        1       1      1      1       0
ABD              0                  0.00%        1       2        2       2      1      1       1
BCD              0                  0.00%        1       2        2       2      1      1       1
TR1              0                  0.00%        1       2        2       3      1      1       1 
RES              0                  0.00%        1       1        1       1      1      1       0    
FTT              0                  0.00%        1       1        1       2      1      2       1 
ZXR              0                  0.00%        0       1        1       1      0      1       2
like image 751
Sophia Wilson Avatar asked Oct 14 '22 21:10

Sophia Wilson


1 Answers

The biggest challenge is that keys come in rows and columns.

I used 2 custom functions to count the occurrences:

  • getcount(), which uses a condition argument (see here if you are not familiar with quosures), for your special conditional rows (TotalID, Blank_list, ...)
  • getcount2(), which uses a simple character argument, for your cases rows (ABC, BCD, TR1...)

Both work mostly the same way. We calculate the count separately for all single cases and for the total, always grouping by ID and concatenate the results.

Here is the code:

library(tidyverse)
library(lubridate)

df <- structure(list(ID = c("P-10012", "P-10012", "P-10012", "P-10013", "P-10013", "P-10013", 
                            "P-10014", "P-10014", "P-10015", "P-10016", "P-10017"), 
                     Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15", "2020-04-15 12:13:15", 
                              "2020-04-12 17:10:05", "2020-04-12 17:10:05", "2020-04-12 17:10:05", 
                              "2020-04-10 04:30:19", "2020-04-10 04:30:19", "2020-04-10 14:13:15", 
                              "2020-04-10 13:13:15", "2020-03-18 10:13:15"), 
                     Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"),
                     List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD", "", "", "", "ABD,BCD", 
                              "ABD,BCD", "ABC", "", "ABC,ABD,BCD")), 
                class = "data.frame", row.names = c(NA, -11L))

#extract all the individual values from Type and List
cases = c(df$Type, str_split(df$List, ", ?", simplify=TRUE)) %>% unique() %>% 
  sort() %>% .[.!=""] %>% rlang::set_names()

#util function
is_blank = function(x) is.na(x) | x==""

#get count for summary rows (TotalID, Blank_list, ...)
getcount = function(cond){
  x = map_dbl(cases, ~df %>% 
                filter(month(Date)==4) %>%
                group_by(ID) %>% 
                summarise(rtn=any({{cond}} & (str_detect(Type, .x) | str_detect(List, .x)))) %>% 
                pull() %>% sum(na.rm=TRUE)
  )
  x_tot = df %>% 
    filter(month(Date)==4) %>% 
    group_by(ID) %>% 
    summarise(rtn=any({{cond}})) %>% 
    pull() %>% sum(na.rm=TRUE)
  
  c(x_tot, x)
}

#get count for cases rows (ABC, BCD, TR1...)
getcount2 = function(key){
  x = map_dbl(cases, ~df %>% 
                filter(month(Date)==4) %>%
                group_by(ID) %>% 
                summarise(rtn=any(
                  (key %in% Type  | str_detect(List, key)) &
                    (str_detect(Type, .x ) | str_detect(List, .x ))
                )) %>% 
                pull() %>% sum(na.rm=TRUE)
  )
  x_tot = df %>% 
    filter(month(Date)==4) %>% 
    group_by(ID) %>% 
    summarise(rtn=any(List==key)) %>% 
    pull() %>% sum(na.rm=TRUE)
  
  c(tot=x_tot, x)
}


#here we go!
tibble(x=c("Distinct_Count", cases)) %>% 
  mutate(
    Total_ID=getcount(TRUE),
    Blank_List=getcount(is_blank(List)),
    Blank_Type=getcount(is_blank(Type)),
    Blank_Both=getcount(is_blank(List) & is_blank(Type)),
    Non_Blank_List=getcount(!is_blank(List)),
    Non_Blank_Type=getcount(!is_blank(Type)),
    Non_Blank_Both=getcount(!is_blank(List) & !is_blank(Type))
  ) %>% 
  bind_cols(map_dfc(cases, ~getcount2(.x))) %>% 
  column_to_rownames("x") %>% 
  t() %>% as.data.frame() %>% 
  mutate(Percentage = scales::percent(Distinct_Count/max(Distinct_Count)), .after="Distinct_Count")
#>                Distinct_Count Percentage ABC ABD BCD FTT RES TR1 ZXR
#> Total_ID                    5       100%   2   2   2   2   1   3   2
#> Blank_List                  2        40%   0   0   0   1   0   1   1
#> Blank_Type                  2        40%   1   0   0   0   0   0   0
#> Blank_Both                  1        20%   0   0   0   0   0   0   0
#> Non_Blank_List              3        60%   2   2   2   1   1   2   1
#> Non_Blank_Type              3        60%   1   2   2   2   1   3   2
#> Non_Blank_Both              2        40%   1   2   2   1   1   2   1
#> ABC                         1        20%   2   1   1   1   1   1   0
#> ABD                         0         0%   1   2   2   1   1   2   1
#> BCD                         0         0%   1   2   2   1   1   2   1
#> FTT                         0         0%   1   1   1   2   1   2   1
#> RES                         0         0%   1   1   1   1   1   1   0
#> TR1                         0         0%   1   2   2   2   1   3   2
#> ZXR                         0         0%   0   1   1   1   0   2   2

Created on 2021-05-12 by the reprex package (v2.0.0)

Note that there are several minor differences with your expected output, but I assume they are little mistakes from your complex example. For instance, ABC has Distinct_Count==1 so out of 5 it should not make 33%. Also, ZXR can be seen with TR1 twice (ID 13 and 14).

like image 51
Dan Chaltiel Avatar answered Oct 20 '22 04:10

Dan Chaltiel