have a simple dataframe listing types of products and tag number per a couple of months:
data.frame(
stringsAsFactors = FALSE,
month = c("jan","jan","jan","jan",
"jan","feb","feb","feb","feb"),
category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)
month category tag_number
jan TB 101
jan GT 101
jan TB 223
jan YT 223
jan GT 223
feb TB 345
feb GT 345
feb TB 655
feb YT 223
I want to compare and extract the percentual difference between the unique tag_number values by month and category.
Let me explain a bit more as this seems a complicated issue.
If we get grouped month and categories we get to compare this table (month+category)
Jan TB 101, 223 vs. Feb TB 345, 655
Jan GT 101, 223 vs. Feb GT 345
Jan YT 223 vs. Feb YT 223
In January the TB category has two unique tag numbers (101 and 223). If you compare with February there are two unique tags too, but none of them are equal so the percent difference between months is 100% and the count distinct is two in both cases.
Same for GT category. All tags are different. So again 100%.
Different case is YT. Both months contained the same tag number so the difference is 0%
Here expected result. The percent diff is the percentage of cases distinct between months.
Let's take TB category.
4 unique values total, 2 unique per month, none of them equal so:
4 / 4 = 1 (so 100%)
category pct_diff
TB 100%
GT 100%
YT 0%
df %>%
group_by(category) %>%
summarise(perc_diff = 100 * mean(table(tag_number) == 1))
# A tibble: 3 x 2
category perc_diff
<chr> <dbl>
1 GT 100
2 TB 100
3 YT 0
Try this solution, but I am not sure whether it can be generalized to a larger case. The approach is to create a wide table then calculate difference percentage.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(
stringsAsFactors = FALSE,
month = c("jan","jan","jan","jan",
"jan","feb","feb","feb","feb"),
category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)
df2 <- df %>%
pivot_wider(names_from = "month", values_from = "tag_number") %>%
mutate(both_months = map2(jan, feb, ~intersect(.x,.y))) %>% #find intersect
mutate(all_lenght = map2(jan,feb, ~union(.x,.y))) %>% #find all length
mutate(percent_diff = map2(both_months, all_lenght,
~(100 - length(.x)*100/length(.y))) %>%
unlist()) # calculate difference percentage
df2
#> # A tibble: 3 x 6
#> category jan feb both_months all_lenght percent_diff
#> <chr> <list> <list> <list> <list> <dbl>
#> 1 TB <int [2]> <int [2]> <int [0]> <int [4]> 100
#> 2 GT <int [2]> <int [1]> <int [0]> <int [3]> 100
#> 3 YT <int [1]> <int [1]> <int [1]> <int [1]> 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With