Find variable combinations that makes Primary Key in R

Question

Here is my toy dataframe.

df <- tibble::tribble(
  ~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
    "A",   "C",    1L,    5L,  "AA",  "AB",    1L,
    "A",   "C",    2L,    5L,  "BB",  "AC",    2L,
    "A",   "D",    1L,    7L,  "AA",  "BC",    2L,
    "A",   "D",    2L,    3L,  "BB",  "CC",    1L,
    "B",   "C",    1L,    8L,  "AA",  "AB",    1L,
    "B",   "C",    2L,    6L,  "BB",  "AC",    2L,
    "B",   "D",    1L,    9L,  "AA",  "BC",    2L,
    "B",   "D",    2L,    6L,  "BB",  "CC",    1L)

How can I get the combination of a minimum number of variables that uniquely identify the observations in the dataframe i.e which variables together can make the primary key?

The way I approached this problem is to find the combination of variables for which distinct values is equal to the number of observations of the data frame. So, those variable combinations that will give me 8 observation, in this case. I randomly tried that and found few:

df %>% distinct(var1, var2, var3)

df %>% distinct(var1, var2, var5)

df %>% distinct(var1, var3, var7)

So vars123, vars125, vars137 deserves to the Primary Key here. How can I find these variable combinations programmatically using R. Also, more preference should be given to character, factor, date, and (maybe) integer variables, if possible, as doubles should not make the Primary Key.

The output could be list or dataframe stating combinations "var1, var2, var3", "var1, var2, var5", "var1, var3, var7".

thelatemail · Accepted Answer

A bit of a variation on the other answers, but here's the requested tabular output:

nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
  vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
  counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)

Then take the least number of variables required to be a primary key:

out[match(nrow(df), out$counts),]
#        vars counts
#12 var1,var6      8

IceCreamToucan · Answer

There may be a better way, but here's a brute-force method

combs <- lapply(seq(ncol(df)), function(x) combn(names(df), x, simplify = F))

keys <- list()
for(i in seq_along(combs)){
  keys[[i]] <- combs[[i]][sapply(combs[[i]], function(x) nrow(distinct(df[x])) == nrow(df))]
  if(length(keys[[i]])) stop(paste('Found key of', i, 'columns, stopping'))
}


keys

# [[1]]
# list()
# 
# [[2]]
# [[2]][[1]]
# [1] "var1" "var6"
# 
# [[2]][[2]]
# [1] "var4" "var6"
# 
# [[2]][[3]]
# [1] "var4" "var7"

Find variable combinations that makes Primary Key in R

Tags:

r

data.table

tidyverse

Geet

2 Answers

thelatemail

IceCreamToucan

Recent Activity

Donate For Us

Find variable combinations that makes Primary Key in R

Tags:

r

data.table

tidyverse

Geet

2 Answers

thelatemail

IceCreamToucan

Related questions

Recent Activity

Donate For Us