I have a data frame (df) or data table (dt) with, let’s say 1000 variables and 1000 observations. I checked that there are no duplicates in the observations, so dt[!duplicated(dt)] has the same length as the original file.
I would like to create an ID variable for all this observation with a combination of some of the 1000 variables I have. Differently to other SO questions as I don’t know which variables are more suitable to create the ID and it is likely that I need a combination of, at least, 3 or 4 variables.
Is there any package/function in R that could get me the most efficient combination of variables to create an ID variable? In my real example I am struggling to create an ID manually, and probably it is not the best combination of variables.
Example with mtcars:
require(data.table)
example <- data.table(mtcars)
rownames(example) <- NULL # Delete mtcars row names
example <- example[!duplicated(example),]
example[,id_var_wrong := paste0(mpg,"_",cyl)]
length(unique(example$id_var_wrong)) # Wrong ID, there are only 27 different values for this variable despite 32 observations
example[,id_var_good := paste0(wt,"_",qsec)]
length(unique(example$id_var_good)) # Good ID as there are equal number of unique values as different observations.
Is there any function to find wt and qsec automatically and not manually?
A homemade algorithm: the principle is to greedily take the variable with the most distinct number of elements and then to filter only the remaining rows with duplicates and to iterate. This doesn't give the best solution but it's an easy way to get a rather good solution quickly.
set.seed(1)
mat <- replicate(1000, sample(c(letters, LETTERS), size = 100, replace = TRUE))
library(dplyr)
columnsID <- function(mat) {
df <- df0 <- as_data_frame(mat)
vars <- c()
while(nrow(df) > 0) {
var_best <- names(which.max(lapply(df, n_distinct)))[[1]]
vars <- append(vars, var_best)
df <- group_by_at(df0, vars) %>% filter(n() > 1)
}
vars
}
columnsID(mat)
[1] "V68" "V32"
In many cases there is a natural key which uniquely identifies each observation. For instance, the mtcars data set has unique row names.
library(data.table)
data.table(mtcars, keep.rownames = "id")
id mpg cyl disp hp drat wt qsec vs am gear carb 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5: Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...
If there is no natural key available, I suggest to create an articifial key by simply numbering the rows consecutively and storing it in an additional column:
data.table(mtcars)[, rn := .I][]
mpg cyl disp hp drat wt qsec vs am gear carb rn 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6 ...
Anything else might not be worth the effort, or is at risk that attribute values may become identical, e.g., when they are rounded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With