Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is digest function in R suitable for anonymising participant identifiers?

Tags:

r

I often conduct research on human participants. For various reasons my preliminary identifier is sometimes a composite of information that reduces anonymity in the data (e.g., I might concatenate a string that include date and time of completion, IP address, and some information supplied by the participant).

Thus, if the data is to be shared in some form, a cleansed ID needs to be created from the preliminary ID. The cleansed ID needs to be stripped of such information. A simple approach in R is just to assign consecutive numbers (e.g., df$id <- seq(nrow(df)) where df is the data.frame). However, if in the initial phase of research more data is collected or the rows are resorted, this can cause problems. I.e., the cleansed ID assigned to a given participant may vary each time the raw dataset is updated. This in turn can break subsequent analyses on the cleansed dataset that for example may have filtered cases based on cleansed ID.

Thus, I thought about creating a hash using the digest function in the digest package.

df$id <- sapply(df$raw_id, digest)

This would seem to lead to a reliable way of going from raw identifier to cleansed identifier, but it would be impossible to get the raw identifier for anyone who only possessed the cleansed identifier.

However, given that I am new to both the digest function and hashing in general, I wanted to ask:

  • Is the digest suitable for stripping IDs of identifying information?
  • Are there any issues in using digest for this purpose?
like image 439
Jeromy Anglim Avatar asked Apr 08 '13 04:04

Jeromy Anglim


People also ask

What kind of constructs and techniques are used in anonymisation of data?

Anonymization can be performed via a range of techniques, including encryption, term or character shuffling, or dictionary substitution.


1 Answers

I have learnt many helpful things from the comments above. This answer aims to distill these comments.

There are two issues with hashing for the purpose of anonymising research participant identifiers:

  • Duplicate IDs: This seems to only be a theoretical, but not a practical issue (possibly especially if you use sha1). But I'm happy to be corrected on this.
  • Lack of anonymity: If you know the hashing algorithm, and you know the id format and you know the exact information making up the id, then you'll be able to work out which participant matches that information. In many cases where the format is not shared, participant information is not known, or the ID uses information that is virtually unknowable, then this is really not an issue. Nonetheless, adding some password text to the ID seems to be a simple solution for preventing this from being an issue.

Thus, to summarise the recommendations that I've gathered.

library(digest)
hashed_id <- function(x, salt) {
    y <- paste(x, salt)
    y <- sapply(y, function(X) digest(X, algo="sha1"))
    as.character(y)
}

mydata$id <- hashed_id(mydata$raw_id, "somesalt1234")
like image 186
Jeromy Anglim Avatar answered Sep 17 '22 15:09

Jeromy Anglim