I have a dataset with 10 columns. The first column is an unique identifier. The 9 other columns are related attributes. For now, let's just say they are integers. If needed, the data could easily be pivoted to a key-value.
Ex:
id|attr1|attr2|attr3|...
a | 2 | 5 | 7 |...
b | 3 | 1 |null |...
c | 2 |null |null |...
d | 1 | 2 | 5 |...
e | 2 | 1 | 3 |...
I'm essentially looking for the most frequent combinations of any length with at least a pair. So my output for this would be:
unq | frequency
1,2 | 2
1,3 | 2
1,5 | 1
2,3 | 1
2,5 | 2
2,7 | 1
1,2,3 | 1
1,2,5 | 1
2,5,7 | 1
(did this manually - so hopefully there are no errors) - the order of the paring doesn't matter. 2,5,7 = 5,2,7 = 7,5,2 etc.
Any thoughts? I am open to different tools. I have access to R, excel, sql server, mysql, etc.
Excel
is preferred but not required!
Here is a solution in R:
Recreate the data
x <- data.frame(
id = letters[1:5],
attr1 = c(2,3,2,1,2),
attr2 = c(5,1,NA,2,1),
attr3 = c(7,NA,NA,5,3))
x
id attr1 attr2 attr3
1 a 2 5 7
2 b 3 1 NA
3 c 2 NA NA
4 d 1 2 5
5 e 2 1 3
Create a function to list all the combinations
make_combinations <- function(data, size){
t1 <- apply(data[, -1], 1, function(data)unname(sort(data)))
t2 <- lapply(t1, function(xt){if(length(xt)>=size){combn(xt, size)}})
t3 <- sapply(t2[!is.na(t2)],
function(chunk){if(!is.null(chunk))apply(chunk, 2, function(x)paste(x, collapse=","))})
t4 <- unlist(t3)
t4
}
Create a second function to count the combinations
count_combinations <- function(data, nn=2:3){
tmp <- unlist(lapply(nn, function(n)make_combinations(data, n)))
sort(table(tmp), decreasing=TRUE)
}
The results:
count_combinations(x, 2:3)
1,2 1,3 2,5 1,2,3 1,2,5 1,5 2,3 2,5,7 2,7 5,7
2 2 2 1 1 1 1 1 1 1
Here's your data, without the id
column.
dfr <- data.frame(
attr1 = c(2,3,2,1,2),
attr2 = c(5,1,NA,2,1),
attr3 = c(7,NA,NA,5,3)
)
This retrieves all the combinations, but the output form takes a little bit of navigating.
lapply(
seq_len(nrow(dfr)), #loop over rows
function(row)
{
lapply(
seq_along(dfr)[-1], #loop over lengths of combination, -1 is to ignore singletons
function(m)
{
combn(dfr[row, ], m)
}
)
}
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With