technique to obfuscate clustered data and preserve privacy in r

Question

background

i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.

as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.

to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.

i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.

request

i am looking for a technique that

prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
does not obliterate the correlations between my columns of data (the replicate weights variables)
can be implemented on an R data.frame object without a major time investment

i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.

what i have tried

i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.

i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.

thanks!!!!

Anthony Damico · Accepted Answer

i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!

http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

technique to obfuscate clustered data and preserve privacy in r

Tags:

r

obfuscation

privacy

survey

Anthony Damico

1 Answers

Anthony Damico

Recent Activity

Donate For Us

technique to obfuscate clustered data and preserve privacy in r

Tags:

r

obfuscation

privacy

survey

Anthony Damico

1 Answers

Anthony Damico

Related questions

Recent Activity

Donate For Us