I have the following dataframe in R.
ID Date List Type
P-10012 2020-04-15 12:13:15 ABC,ABD,BCD TR1
P-10012 2020-04-15 12:13:15 ABC,ABD,BCD RES
P-10012 2020-04-15 12:13:15 ABC,ABD,BCD FTT
P-10013 2020-04-12 17:10:05 TR1
P-10013 2020-04-12 17:10:05 FTT
P-10013 2020-04-12 17:10:05 ZXR
P-10014 2020-04-10 04:30:19 ABD,BCD TR1
P-10014 2020-04-10 04:30:19 ABD,BCD ZXR
P-10015 2020-04-10 14:13:15 ABC
P-10016 2020-04-10 13:13:15
P-10017 2020-03-18 10:13:15 ABC,ABD,BCD TR1
dput(df)
df <- structure(list(ID = c("P-10012", "P-10012",
"P-10012", "P-10013", "P-10013", "P-10013",
"P-10014", "P-10014", "P-10015", "P-10016",
"P-10017"), Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15",
"2020-04-15 12:13:15", "2020-04-12 17:10:05", "2020-04-12 17:10:05",
"2020-04-12 17:10:05", "2020-04-10 04:30:19", "2020-04-10 04:30:19",
"2020-04-10 14:13:15", "2020-04-10 13:13:15", "2020-03-18 10:13:15"
), Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"), List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD",
"", "", "", "ABD,BCD", "ABD,BCD", "ABC", "", "ABC,ABD,BCD")), class = "data.frame", row.names = c(NA,
-11L))
The structure of the dataframe is that it will always have the same List
value for a particular ID
in case if there are multiple row available for that particular ID
because it has multiple different value in Type
. If for a particular ID
there is only 1 Type
value then it will always have one row.
I need to create the following distribution for the month of Apr-20
of List
values as mentioned in the comma separated manner and Type
values.
Where, the first 7 rows in my Required Df
are distinct count of ID
based on the condition (i.e whether List
or Type
are blank or not) and distribution across all unique List
and Type
value. For these 7 rows the Distinct_Count
should be divided by Total
to get Percentage
. However, from 8th row and onward if the unique value is form List
then it should be divided by total distinct count of Non_Blank_List
and if the value is from Type
then it should be divided by total distinct count of Non_Blank_Type
.
In the following matrix, I just want to understand what is the distribution of unique values of List
and Type
distinctly and in combination with other values.
Please not that for the example purpose I have simplified the List
and Type
values in 3 and 4 unique values respectively but in my actual dataframe it is quite high and it varies from month to month so please don't hard code the values.
I have tried multiple approach but couldn't achieve the required output yet.
Required Df<-
APR-21 Distinct_Count Percentage ABC ABD BCD TR1 RES FTT ZXR
Total_ID 5 100.00% 2 2 2 3 1 2 2
Blank_List 2 40.00% 0 0 0 1 0 1 1
Blank_Type 2 40.00% 1 0 0 0 0 0 0
Both_Blank 1 20.00% 0 0 0 0 0 0 0
Non_Blank_List 3 60.00% 2 2 2 2 1 1 1
Non_Blank_Type 3 60.00% 1 2 2 3 1 1 2
Both_Non_Blank 2 40.00% 1 2 2 2 1 1 1
ABC 1 33.33% 2 1 1 1 1 1 0
ABD 0 0.00% 1 2 2 2 1 1 1
BCD 0 0.00% 1 2 2 2 1 1 1
TR1 0 0.00% 1 2 2 3 1 1 1
RES 0 0.00% 1 1 1 1 1 1 0
FTT 0 0.00% 1 1 1 2 1 2 1
ZXR 0 0.00% 0 1 1 1 0 1 2
The biggest challenge is that keys come in rows and columns.
I used 2 custom functions to count the occurrences:
getcount()
, which uses a condition argument (see here if you are not familiar with quosures), for your special conditional rows (TotalID, Blank_list, ...)getcount2()
, which uses a simple character argument, for your cases rows (ABC, BCD, TR1...)Both work mostly the same way. We calculate the count separately for all single cases and for the total, always grouping by ID
and concatenate the results.
Here is the code:
library(tidyverse)
library(lubridate)
df <- structure(list(ID = c("P-10012", "P-10012", "P-10012", "P-10013", "P-10013", "P-10013",
"P-10014", "P-10014", "P-10015", "P-10016", "P-10017"),
Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15", "2020-04-15 12:13:15",
"2020-04-12 17:10:05", "2020-04-12 17:10:05", "2020-04-12 17:10:05",
"2020-04-10 04:30:19", "2020-04-10 04:30:19", "2020-04-10 14:13:15",
"2020-04-10 13:13:15", "2020-03-18 10:13:15"),
Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"),
List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD", "", "", "", "ABD,BCD",
"ABD,BCD", "ABC", "", "ABC,ABD,BCD")),
class = "data.frame", row.names = c(NA, -11L))
#extract all the individual values from Type and List
cases = c(df$Type, str_split(df$List, ", ?", simplify=TRUE)) %>% unique() %>%
sort() %>% .[.!=""] %>% rlang::set_names()
#util function
is_blank = function(x) is.na(x) | x==""
#get count for summary rows (TotalID, Blank_list, ...)
getcount = function(cond){
x = map_dbl(cases, ~df %>%
filter(month(Date)==4) %>%
group_by(ID) %>%
summarise(rtn=any({{cond}} & (str_detect(Type, .x) | str_detect(List, .x)))) %>%
pull() %>% sum(na.rm=TRUE)
)
x_tot = df %>%
filter(month(Date)==4) %>%
group_by(ID) %>%
summarise(rtn=any({{cond}})) %>%
pull() %>% sum(na.rm=TRUE)
c(x_tot, x)
}
#get count for cases rows (ABC, BCD, TR1...)
getcount2 = function(key){
x = map_dbl(cases, ~df %>%
filter(month(Date)==4) %>%
group_by(ID) %>%
summarise(rtn=any(
(key %in% Type | str_detect(List, key)) &
(str_detect(Type, .x ) | str_detect(List, .x ))
)) %>%
pull() %>% sum(na.rm=TRUE)
)
x_tot = df %>%
filter(month(Date)==4) %>%
group_by(ID) %>%
summarise(rtn=any(List==key)) %>%
pull() %>% sum(na.rm=TRUE)
c(tot=x_tot, x)
}
#here we go!
tibble(x=c("Distinct_Count", cases)) %>%
mutate(
Total_ID=getcount(TRUE),
Blank_List=getcount(is_blank(List)),
Blank_Type=getcount(is_blank(Type)),
Blank_Both=getcount(is_blank(List) & is_blank(Type)),
Non_Blank_List=getcount(!is_blank(List)),
Non_Blank_Type=getcount(!is_blank(Type)),
Non_Blank_Both=getcount(!is_blank(List) & !is_blank(Type))
) %>%
bind_cols(map_dfc(cases, ~getcount2(.x))) %>%
column_to_rownames("x") %>%
t() %>% as.data.frame() %>%
mutate(Percentage = scales::percent(Distinct_Count/max(Distinct_Count)), .after="Distinct_Count")
#> Distinct_Count Percentage ABC ABD BCD FTT RES TR1 ZXR
#> Total_ID 5 100% 2 2 2 2 1 3 2
#> Blank_List 2 40% 0 0 0 1 0 1 1
#> Blank_Type 2 40% 1 0 0 0 0 0 0
#> Blank_Both 1 20% 0 0 0 0 0 0 0
#> Non_Blank_List 3 60% 2 2 2 1 1 2 1
#> Non_Blank_Type 3 60% 1 2 2 2 1 3 2
#> Non_Blank_Both 2 40% 1 2 2 1 1 2 1
#> ABC 1 20% 2 1 1 1 1 1 0
#> ABD 0 0% 1 2 2 1 1 2 1
#> BCD 0 0% 1 2 2 1 1 2 1
#> FTT 0 0% 1 1 1 2 1 2 1
#> RES 0 0% 1 1 1 1 1 1 0
#> TR1 0 0% 1 2 2 2 1 3 2
#> ZXR 0 0% 0 1 1 1 0 2 2
Created on 2021-05-12 by the reprex package (v2.0.0)
Note that there are several minor differences with your expected output, but I assume they are little mistakes from your complex example. For instance, ABC has Distinct_Count==1 so out of 5 it should not make 33%. Also, ZXR can be seen with TR1 twice (ID 13 and 14).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With