I'm having trouble figuring out how to do the opposite of the answer to this question (and in R not python).
Count the amount of times value A occurs with value B
Basically I have a dataframe with a lot of combinations of pairs of columns like so:
df <- data.frame(id1 = c("1","1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","2","3","4","1","3","4","1","4","2","1"))
I want to count, how often all the values in column A occur in the whole dataframe without the values from column B. So the results for this small example would be the output of:
df_result <- data.frame(id1 = c("1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","3","4","1","3","4","1","4","2","1"),
count = c("4","5","5","3","5","4","2","3","3","3"))
The important criteria for this, is that the final results dataframe is collapsed by the pairs (so in my example rows 1 and 2 are duplicates, and they are collapsed and summed by the total frequency 1 is observed without 2). For tallying the count of occurances, it's important that both columns are examined. I.e. order of columns doesn't matter for calculating the frequency - if column A has 1 and B has 2, this counts the same as if column A has 2 and B has 1.
I can do this very slowly by filtering for each pair, but it's not really feasible for my real data where I have many many different pairs.
Any guidance is greatly appreciated.
First paste the two id columns together to id12 for later matching. Then use sapply to go through all rows to see the records where id1 appears in id12 but id2 doesn't. sum that value and only output the distinct records. Finally, remove the id12 column.
library(dplyr)
df %>% mutate(id12 = paste0(id1, id2),
count = sapply(1:nrow(.),
function(x)
sum(grepl(id1[x], id12) & !grepl(id2[x], id12)))) %>%
distinct() %>%
select(-id12)
Or in base R completely:
id12 <- paste0(df$id1, df$id2)
df$count <- sapply(1:nrow(df), function(x) sum(grepl(df$id1[x], id12) & !grepl(df$id2[x], id12)))
df <- df[!duplicated(df),]
id1 id2 count
1 1 2 4
2 1 3 5
3 1 4 5
4 2 1 3
5 2 3 5
6 2 4 4
7 3 1 2
8 3 4 3
9 4 2 3
10 4 1 3
A full tidyverse version:
library(tidyverse)
df %>%
mutate(id = paste(id1, id2),
count = map(cur_group_rows(), ~ sum(str_detect(id, id1[.x]) & str_detect(id, id2[.x], negate = T))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With