I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information. I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash. Is there existing functionality in pandas or some other package i can utilize for this?

You can using <code>ngroup</code> or <code>factorize</code> from <code>pandas</code> <pre class="prettyprint"><code>df.groupby('ssn').ngroup() Out[25]: 0 0 1 1 2 2 3 4 4 3 5 0 dtype: int64 </code></pre> <hr> <pre class="prettyprint"><code>pd.factorize(df.ssn)[0] Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64) </code></pre> <hr> In sklearn, if you are doing ML , I will recommend this approach <pre class="prettyprint"><code>from sklearn import preprocessing le = preprocessing.LabelEncoder() le.fit(df.ssn).transform(df.ssn) Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64) </code></pre>

Anonymize specific columns with pii in pandas dataframe python

Tags:

python

pandas

privacy

pii

anonymize

I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information.

I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash.

Is there existing functionality in pandas or some other package i can utilize for this?

535

asked Dec 28 '17 13:12

JanBennk

2 Answers

Using a Categorical would be an efficient way to do this - the main caveat is that the numbering will be based solely on the ordering in the data, so some care will be needed if this numbering scheme needs to be used across multiple columns / datasets.

df = pd.DataFrame({'ssn': [1, 2, 3, 999, 10, 1]})

df['ssn_anon'] = df['ssn'].astype('category').cat.codes

df
Out[38]: 
   ssn  ssn_anon
0    1         0
1    2         1
2    3         2
3  999         4
4   10         3
5    1         0

177

answered Sep 24 '22 13:09

chrisb

You can using ngroup or factorize from pandas

df.groupby('ssn').ngroup()
Out[25]: 
0    0
1    1
2    2
3    4
4    3
5    0
dtype: int64

pd.factorize(df.ssn)[0]
Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64)

In sklearn, if you are doing ML , I will recommend this approach

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.ssn).transform(df.ssn)

Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64)

answered Sep 22 '22 13:09

BENY

Related questions
                            
                                How to parse the output received by gRPC stub client from tensorflow serving server?
                            
                                Count number of special characters [^&$#] appearing in a paragraph
                            
                                python rstrip or remove end of string by a pattern of characters
                            
                                Insert a node into an abstract syntax tree
                            
                                Converting raw file content from Flask file upload into dataframe using pandas
                            
                                Pandas error in Python: columns must be same length as key
                            
                                How To Push a Spark Dataframe to Elastic Search (Pyspark)
                            
                                git reset --hard HEAD vs git checkout <file>
                            
                                Thicken a one pixel line
                            
                                Efficient way to sample a large array many times with NumPy?
                            
                                Finding max /min value of individual columns
                            
                                How to convert a scipy csr_matrix back into lists of row, col and data?
                            
                                Where is the code for gradient descent?
                            
                                Rounding To Nearest Bin
                            
                                Show plots in new window instead of inline (not answered by previous posts)
                            
                                SciKit Learn SVR runs very long
                            
                                Pandas side-by-side stacked bar plot
                            
                                PyPy 17x faster than Python. Can Python be sped up?
                            
                                Get Base64 image dimension in python
                            
                                How to append a row to another dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With