Here is my toy dataframe.
df <- tibble::tribble(
~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
"A", "C", 1L, 5L, "AA", "AB", 1L,
"A", "C", 2L, 5L, "BB", "AC", 2L,
"A", "D", 1L, 7L, "AA", "BC", 2L,
"A", "D", 2L, 3L, "BB", "CC", 1L,
"B", "C", 1L, 8L, "AA", "AB", 1L,
"B", "C", 2L, 6L, "BB", "AC", 2L,
"B", "D", 1L, 9L, "AA", "BC", 2L,
"B", "D", 2L, 6L, "BB", "CC", 1L)
How can I get the combination of a minimum number of variables that uniquely identify the observations in the dataframe i.e which variables together can make the primary key?
The way I approached this problem is to find the combination of variables for which distinct values is equal to the number of observations of the data frame. So, those variable combinations that will give me 8 observation, in this case. I randomly tried that and found few:
df %>% distinct(var1, var2, var3)
df %>% distinct(var1, var2, var5)
df %>% distinct(var1, var3, var7)
So vars123, vars125, vars137 deserves to the Primary Key here. How can I find these variable combinations programmatically using R. Also, more preference should be given to character, factor, date, and (maybe) integer variables, if possible, as doubles should not make the Primary Key.
The output could be list or dataframe stating combinations "var1, var2, var3", "var1, var2, var5", "var1, var3, var7".
A bit of a variation on the other answers, but here's the requested tabular output:
nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)
Then take the least number of variables required to be a primary key:
out[match(nrow(df), out$counts),]
# vars counts
#12 var1,var6 8
There may be a better way, but here's a brute-force method
combs <- lapply(seq(ncol(df)), function(x) combn(names(df), x, simplify = F))
keys <- list()
for(i in seq_along(combs)){
keys[[i]] <- combs[[i]][sapply(combs[[i]], function(x) nrow(distinct(df[x])) == nrow(df))]
if(length(keys[[i]])) stop(paste('Found key of', i, 'columns, stopping'))
}
keys
# [[1]]
# list()
#
# [[2]]
# [[2]][[1]]
# [1] "var1" "var6"
#
# [[2]][[2]]
# [1] "var4" "var6"
#
# [[2]][[3]]
# [1] "var4" "var7"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With