Some of the data I work with contain sensitive information (names of persons, dates, locations, etc). But I sometimes need to share "the numbers" with other persons to get help with statistical analysis, or process it on more powerful machines where I can't control who looks at the data. Ideally I would like to work like this: <ol> <li>Read the data into R (look at it, clean it, etc.)</li> <li>Select a data frame that I want to de-classify, run it through a package and receive two "files": the de-classified data and a translation-file. The latter I will keep myself.</li> <li>The de-classified data can be shared, manipulated and processed without worries.</li> <li>I re-classify the processed data together with the translation-file.</li> </ol> I suppose that this can also be useful when uploading data for processing "in the cloud" (Amazon, etc.). Have you been in this situation? I first thought about writing a "randomize" function myself, but then I realized there is no end on how sophisticated this can be done (for example, offsetting time-stamps without losing order). Maybe there is already a defined method or tool? Thanks to everyone who contributes to [r]-tag here at Stack Overflow!

One way to do this is with <code>match</code>. First I make a small dataframe: <pre class="prettyprint"><code>foo <- data.frame( person=c("Mickey","Donald","Daisy","Scrooge"), score=rnorm(4)) foo person score 1 Mickey -0.07891709 2 Donald 0.88678481 3 Daisy 0.11697127 4 Scrooge 0.31863009 </code></pre> Then I make a key: <pre class="prettyprint"><code>set.seed(100) key <- as.character(foo$person[sample(1:nrow(foo))]) </code></pre> You must save this key obviously somewhere. Now I can encode the persons: <pre class="prettyprint"><code>foo$person <- match(foo$person, key) foo person score 1 2 0.3186301 2 1 -0.5817907 3 4 0.7145327 4 3 -0.8252594 </code></pre> If I want the person names again I can index the <code>key</code>: <pre class="prettyprint"><code>key[foo$person] [1] "Mickey" "Donald" "Daisy" "Scrooge" </code></pre> Or use <code>tranform</code>, this also works if the data is changed as long as the person ID remains the same: <pre class="prettyprint"><code>foo <-rbind(foo,foo[sample(1:4),],foo[sample(1:4,2),],foo) foo person score 1 2 0.3186301 2 1 -0.5817907 3 4 0.7145327 4 3 -0.8252594 21 1 -0.5817907 41 3 -0.8252594 31 4 0.7145327 15 2 0.3186301 32 4 0.7145327 16 2 0.3186301 11 2 0.3186301 12 1 -0.5817907 13 4 0.7145327 14 3 -0.8252594 transform(foo, person=key[person]) person score 1 Mickey 0.3186301 2 Donald -0.5817907 3 Daisy 0.7145327 4 Scrooge -0.8252594 21 Donald -0.5817907 41 Scrooge -0.8252594 31 Daisy 0.7145327 15 Mickey 0.3186301 32 Daisy 0.7145327 16 Mickey 0.3186301 11 Mickey 0.3186301 12 Donald -0.5817907 13 Daisy 0.7145327 14 Scrooge -0.8252594 </code></pre>

How can I de- and re-classify data?

Tags:

r

Some of the data I work with contain sensitive information (names of persons, dates, locations, etc). But I sometimes need to share "the numbers" with other persons to get help with statistical analysis, or process it on more powerful machines where I can't control who looks at the data.

Ideally I would like to work like this:

Read the data into R (look at it, clean it, etc.)
Select a data frame that I want to de-classify, run it through a package and receive two "files": the de-classified data and a translation-file. The latter I will keep myself.
The de-classified data can be shared, manipulated and processed without worries.
I re-classify the processed data together with the translation-file.

I suppose that this can also be useful when uploading data for processing "in the cloud" (Amazon, etc.).

Have you been in this situation? I first thought about writing a "randomize" function myself, but then I realized there is no end on how sophisticated this can be done (for example, offsetting time-stamps without losing order). Maybe there is already a defined method or tool?

Thanks to everyone who contributes to [r]-tag here at Stack Overflow!

956

asked Feb 21 '11 14:02

Chris

2 Answers

One way to do this is with match. First I make a small dataframe:

foo <- data.frame( person=c("Mickey","Donald","Daisy","Scrooge"), score=rnorm(4))
foo
   person       score
1  Mickey -0.07891709
2  Donald  0.88678481
3   Daisy  0.11697127
4 Scrooge  0.31863009

Then I make a key:

set.seed(100)
key <- as.character(foo$person[sample(1:nrow(foo))])

You must save this key obviously somewhere. Now I can encode the persons:

foo$person <- match(foo$person, key)
foo
  person      score
1      2  0.3186301
2      1 -0.5817907
3      4  0.7145327
4      3 -0.8252594

If I want the person names again I can index the key:

key[foo$person]
[1] "Mickey"  "Donald"  "Daisy"   "Scrooge"

Or use tranform, this also works if the data is changed as long as the person ID remains the same:

foo <-rbind(foo,foo[sample(1:4),],foo[sample(1:4,2),],foo)
foo
   person      score
1       2  0.3186301
2       1 -0.5817907
3       4  0.7145327
4       3 -0.8252594
21      1 -0.5817907
41      3 -0.8252594
31      4  0.7145327
15      2  0.3186301
32      4  0.7145327
16      2  0.3186301
11      2  0.3186301
12      1 -0.5817907
13      4  0.7145327
14      3 -0.8252594
transform(foo, person=key[person])
    person      score
1   Mickey  0.3186301
2   Donald -0.5817907
3    Daisy  0.7145327
4  Scrooge -0.8252594
21  Donald -0.5817907
41 Scrooge -0.8252594
31   Daisy  0.7145327
15  Mickey  0.3186301
32   Daisy  0.7145327
16  Mickey  0.3186301
11  Mickey  0.3186301
12  Donald -0.5817907
13   Daisy  0.7145327
14 Scrooge -0.8252594

answered Sep 27 '22 17:09

Sacha Epskamp

Can you simply assign a GUID to the row from which you have removed all of the sensitive information? As long as your colleagues lacking the security clearance don't mess with the GUID, you'd be able to incorporate any changes and additions they may make simply by joining on the GUID. Then it becomes simply a matter of generating bogus ersatz values for the columns whose data you have purged. LastName1, LastName2, City1, City2, etc etc. EDIT: You'd have a table for each purged column, e.g. City, State, Zip, FirstName, LastName each of which contains the distinct set of the real classified values in that column and an integer value. So that "Jones" could be represented in the sanitized dataset as, say, LastName22, "Schenectady" as City343, "90210" as Zipcode716. This would give your colleagues valid values to work with (e.g. they'd have the same number of distinct cities as your real data, just with anonymized names) and the interrelationships of the anonymized data are preserved.. EDIT2: if the goal is to give your colleagues sanitized data that is still statistically meaningful, then date columns would require special processing. E.g. if your colleagues need to do statistical computations on the age of the person, you have to give them something close to the original date, not so close that it could be revealing, yet not so far that it could skew the analysis.

answered Sep 27 '22 17:09

Tim

Related questions
                            
                                R: all possible combinations from a vector of elements with 2 possible conditions (+/-)
                            
                                Remove columns that have only a unique value
                            
                                R arrow: Error: Support for codec 'snappy' not built
                            
                                How do I build a dplyr summarize statement programmatically?
                            
                                Is there way in ggplot2 to place text on a curved path?
                            
                                R: split-apply-combine for geographic distance
                            
                                How can a function parameter be used without mentioning it in the function body?
                            
                                Plot multiple sets of points in R
                            
                                Writing temporary data from R
                            
                                How to create a "Clustergram" plot ? (in R)
                            
                                The modules in Revolution R are open sourced. Does the R license imply that I can use the R packages that comes with it free of charge? [closed]
                            
                                generate random sequence and plot in R
                            
                                how do i pass parameters to subset()?
                            
                                Why can't I pass a dataset to a function?
                            
                                How can I pass a ggplot2 aesthetic from a variable?
                            
                                How can I add a background grid using ggplot2?
                            
                                Generating a very large matrix of string combinations using combn() and bigmemory package
                            
                                calculate average over multiple data frames
                            
                                Get positions for NAs only in the "middle" of a matrix column
                            
                                R: first N of all permutations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With