Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find variable combinations that makes Primary Key in R

Here is my toy dataframe.

df <- tibble::tribble(
  ~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
    "A",   "C",    1L,    5L,  "AA",  "AB",    1L,
    "A",   "C",    2L,    5L,  "BB",  "AC",    2L,
    "A",   "D",    1L,    7L,  "AA",  "BC",    2L,
    "A",   "D",    2L,    3L,  "BB",  "CC",    1L,
    "B",   "C",    1L,    8L,  "AA",  "AB",    1L,
    "B",   "C",    2L,    6L,  "BB",  "AC",    2L,
    "B",   "D",    1L,    9L,  "AA",  "BC",    2L,
    "B",   "D",    2L,    6L,  "BB",  "CC",    1L)

How can I get the combination of a minimum number of variables that uniquely identify the observations in the dataframe i.e which variables together can make the primary key?

The way I approached this problem is to find the combination of variables for which distinct values is equal to the number of observations of the data frame. So, those variable combinations that will give me 8 observation, in this case. I randomly tried that and found few:

df %>% distinct(var1, var2, var3)

df %>% distinct(var1, var2, var5)

df %>% distinct(var1, var3, var7)

So vars123, vars125, vars137 deserves to the Primary Key here. How can I find these variable combinations programmatically using R. Also, more preference should be given to character, factor, date, and (maybe) integer variables, if possible, as doubles should not make the Primary Key.

The output could be list or dataframe stating combinations "var1, var2, var3", "var1, var2, var5", "var1, var3, var7".

like image 943
Geet Avatar asked Nov 01 '18 20:11

Geet


2 Answers

A bit of a variation on the other answers, but here's the requested tabular output:

nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
  vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
  counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)

Then take the least number of variables required to be a primary key:

out[match(nrow(df), out$counts),]
#        vars counts
#12 var1,var6      8
like image 86
thelatemail Avatar answered Oct 17 '22 04:10

thelatemail


There may be a better way, but here's a brute-force method

combs <- lapply(seq(ncol(df)), function(x) combn(names(df), x, simplify = F))

keys <- list()
for(i in seq_along(combs)){
  keys[[i]] <- combs[[i]][sapply(combs[[i]], function(x) nrow(distinct(df[x])) == nrow(df))]
  if(length(keys[[i]])) stop(paste('Found key of', i, 'columns, stopping'))
}


keys

# [[1]]
# list()
# 
# [[2]]
# [[2]][[1]]
# [1] "var1" "var6"
# 
# [[2]][[2]]
# [1] "var4" "var6"
# 
# [[2]][[3]]
# [1] "var4" "var7"
like image 29
IceCreamToucan Avatar answered Oct 17 '22 04:10

IceCreamToucan