Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

technique to obfuscate clustered data and preserve privacy in r

background

i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.

as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.

to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.

i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.

request

i am looking for a technique that

  • prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
  • does not obliterate the correlations between my columns of data (the replicate weights variables)
  • can be implemented on an R data.frame object without a major time investment

i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.

what i have tried

i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.

i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.

thanks!!!!

like image 886
Anthony Damico Avatar asked Jun 13 '14 09:06

Anthony Damico


1 Answers

i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!

http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

like image 189
Anthony Damico Avatar answered Nov 07 '22 01:11

Anthony Damico