Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create example data set from private data (replacing variable names and levels with uninformative place holders)?

Tags:

r

To provide a reproducible example of an approach, a data set must often be provided. Instead of building an example data set, I wish to use some of my own data. However this data can not be released. I wish to replace variable (column) names and factor levels with uninformative place holders (eg. V1....V5, L1....L5).

Is an automated way to do this available?

Ideally, this would be done in R, taking in a data.frame and producing this anonymous data.frame.

With such a data set, simply search and replace variable names in your script and you have a publicly releasable reproducible example.

Such a process may increase the inclusion of appropriate data in reproducible examples and even the inclusion of reproducible examples in questions, comments and bug reports.

like image 881
Etienne Low-Décarie Avatar asked May 04 '12 19:05

Etienne Low-Décarie


1 Answers

I don't know whether there was a function to automate this, but now there is ;)

## A function to anonymise columns in 'colIDs'  ##    colIDs can be either column names or integer indices anonymiseColumns <- function(df, colIDs) {     id <- if(is.character(colIDs)) match(colIDs, names(df)) else colIDs     for(id in colIDs) {         prefix <- sample(LETTERS, 1)         suffix <- as.character(as.numeric(as.factor(df[[id]])))         df[[id]] <- paste(prefix, suffix, sep="")     }     names(df)[id] <- paste("V", id, sep="")     df }  ## A data.frame containing sensitive information df <- data.frame(     name = rep(readLines(file.path(R.home("doc"), "AUTHORS"))[9:13], each=2),     hiscore = runif(10, 99, 100),     passwd = replicate(10, paste(sample(c(LETTERS, letters), 9), collapse="")))  ## Anonymise it df2 <- anonymiseColumns(df, c(1,3))  ## Check that it worked > head(df, 3)            name  hiscore    passwd 1 Douglas Bates 99.96714 ROELIAncz 2 Douglas Bates 99.07243 gDOLNMyVe 3 John Chambers 99.55322 xIVPHDuEW      > head(df2, 3)   name hiscore  V3 1   Q1 99.96714 V8 2   Q1 99.07243 V2 3   Q2 99.55322 V9 
like image 164
Josh O'Brien Avatar answered Sep 18 '22 04:09

Josh O'Brien