Is digest function in R suitable for anonymising participant identifiers?

Q: What kind of constructs and techniques are used in anonymisation of data?

Anonymization can be performed via a range of techniques, including encryption, term or character shuffling, or dictionary substitution.

Tags:

r

I often conduct research on human participants. For various reasons my preliminary identifier is sometimes a composite of information that reduces anonymity in the data (e.g., I might concatenate a string that include date and time of completion, IP address, and some information supplied by the participant).

Thus, if the data is to be shared in some form, a cleansed ID needs to be created from the preliminary ID. The cleansed ID needs to be stripped of such information. A simple approach in R is just to assign consecutive numbers (e.g., df$id <- seq(nrow(df)) where df is the data.frame). However, if in the initial phase of research more data is collected or the rows are resorted, this can cause problems. I.e., the cleansed ID assigned to a given participant may vary each time the raw dataset is updated. This in turn can break subsequent analyses on the cleansed dataset that for example may have filtered cases based on cleansed ID.

Thus, I thought about creating a hash using the digest function in the digest package.

df$id <- sapply(df$raw_id, digest)

This would seem to lead to a reliable way of going from raw identifier to cleansed identifier, but it would be impossible to get the raw identifier for anyone who only possessed the cleansed identifier.

However, given that I am new to both the digest function and hashing in general, I wanted to ask:

Is the digest suitable for stripping IDs of identifying information?
Are there any issues in using digest for this purpose?

439

asked Apr 08 '13 04:04

Jeromy Anglim

1 Answers

I have learnt many helpful things from the comments above. This answer aims to distill these comments.

There are two issues with hashing for the purpose of anonymising research participant identifiers:

Duplicate IDs: This seems to only be a theoretical, but not a practical issue (possibly especially if you use sha1). But I'm happy to be corrected on this.
Lack of anonymity: If you know the hashing algorithm, and you know the id format and you know the exact information making up the id, then you'll be able to work out which participant matches that information. In many cases where the format is not shared, participant information is not known, or the ID uses information that is virtually unknowable, then this is really not an issue. Nonetheless, adding some password text to the ID seems to be a simple solution for preventing this from being an issue.

Thus, to summarise the recommendations that I've gathered.

library(digest)
hashed_id <- function(x, salt) {
    y <- paste(x, salt)
    y <- sapply(y, function(X) digest(X, algo="sha1"))
    as.character(y)
}

mydata$id <- hashed_id(mydata$raw_id, "somesalt1234")

186

answered Sep 17 '22 15:09

Jeromy Anglim

Related questions
                            
                                ggarrange plot all plots in a list
                            
                                two column beamer/sweave slide with grid graphic
                            
                                What's the intention of opencpu.org compared to other approaches?
                            
                                Whats the impact of requiring a package inside a function if the package is already loaded?
                            
                                Usage of caret with gbm method for multiclass classification
                            
                                How to bootstrap respecting within-subject information?
                            
                                How to fit autoregressive poisson mixed model (count time series) in R?
                            
                                Match legend text color in geom_text to symbol
                            
                                shinyapps.io and github Packages
                            
                                How to use dplyr::arrange(desc()) when using a string as column name?
                            
                                avoid double refresh of plot in shiny
                            
                                Really fast word ngram vectorization in R
                            
                                Setting absolute size of facets in ggplot2
                            
                                how do I use geom_rect with discrete axis values
                            
                                How to plot positions along a chromosome graphic
                            
                                RStudio doesn't save picture
                            
                                Computing the null space of a bigmatrix in R
                            
                                Applying empty brackets in R drops attributes? (reading the R language definition)
                            
                                How can we test functions that aren't exposed when building R packages?
                            
                                Interactive point labels with gridSVG and ggplot2 v.0.9.0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is digest function in R suitable for anonymising participant identifiers?

Tags:

r

Jeromy Anglim

People also ask

1 Answers

Jeromy Anglim

Recent Activity

Donate For Us