Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anonymize specific columns with pii in pandas dataframe python

I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information.

I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash.

Is there existing functionality in pandas or some other package i can utilize for this?

like image 535
JanBennk Avatar asked Dec 28 '17 13:12

JanBennk


People also ask

How do I anonymise the PII fields in my data?

Instead, you can anonymise the PII fields in your data using hashing. What is hashing? Hashing is a one-way process of transforming a string of plaintext characters into a unique string of fixed length. The hashing process has two important characteristics: It is very difficult to convert a hashed string into its original form

What should a data scientist do with PII when sharing data?

A common scenario encountered by Data Scientists is sharing data with others. But what should you do if that data contains personally identifiable information (PII) such as email addresses, customer IDs or phone numbers? A simple solution is to remove these fields before sharing the data. However, your analysis may rely on having the PII data.

What is anonymizing data and why should you do it?

Anonymizing data offers one solution. When data is anonymized, it is no longer personal data. The situation is different with pseudonymized data. With the appropriate additional knowledge, it is possible to determine the reference person. The most easy way to make your data compliant is to just delete the columns which have GDPR relevant data.

What should the anonymized dataset look like?

The anonymized dataset should have the same amount of data and maintain its analytical value. As shown in the figure below, one possible transformation simply maps original information to fake and therefore anonymous information but maintains the same overall structure.


2 Answers

Using a Categorical would be an efficient way to do this - the main caveat is that the numbering will be based solely on the ordering in the data, so some care will be needed if this numbering scheme needs to be used across multiple columns / datasets.

df = pd.DataFrame({'ssn': [1, 2, 3, 999, 10, 1]})

df['ssn_anon'] = df['ssn'].astype('category').cat.codes

df
Out[38]: 
   ssn  ssn_anon
0    1         0
1    2         1
2    3         2
3  999         4
4   10         3
5    1         0
like image 177
chrisb Avatar answered Sep 24 '22 13:09

chrisb


You can using ngroup or factorize from pandas

df.groupby('ssn').ngroup()
Out[25]: 
0    0
1    1
2    2
3    4
4    3
5    0
dtype: int64

pd.factorize(df.ssn)[0]
Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64)

In sklearn, if you are doing ML , I will recommend this approach

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.ssn).transform(df.ssn)

Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64)
like image 30
BENY Avatar answered Sep 22 '22 13:09

BENY