I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information.
I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash.
Is there existing functionality in pandas or some other package i can utilize for this?
Instead, you can anonymise the PII fields in your data using hashing. What is hashing? Hashing is a one-way process of transforming a string of plaintext characters into a unique string of fixed length. The hashing process has two important characteristics: It is very difficult to convert a hashed string into its original form
A common scenario encountered by Data Scientists is sharing data with others. But what should you do if that data contains personally identifiable information (PII) such as email addresses, customer IDs or phone numbers? A simple solution is to remove these fields before sharing the data. However, your analysis may rely on having the PII data.
Anonymizing data offers one solution. When data is anonymized, it is no longer personal data. The situation is different with pseudonymized data. With the appropriate additional knowledge, it is possible to determine the reference person. The most easy way to make your data compliant is to just delete the columns which have GDPR relevant data.
The anonymized dataset should have the same amount of data and maintain its analytical value. As shown in the figure below, one possible transformation simply maps original information to fake and therefore anonymous information but maintains the same overall structure.
Using a Categorical
would be an efficient way to do this - the main caveat is that the numbering will be based solely on the ordering in the data, so some care will be needed if this numbering scheme needs to be used across multiple columns / datasets.
df = pd.DataFrame({'ssn': [1, 2, 3, 999, 10, 1]})
df['ssn_anon'] = df['ssn'].astype('category').cat.codes
df
Out[38]:
ssn ssn_anon
0 1 0
1 2 1
2 3 2
3 999 4
4 10 3
5 1 0
You can using ngroup
or factorize
from pandas
df.groupby('ssn').ngroup()
Out[25]:
0 0
1 1
2 2
3 4
4 3
5 0
dtype: int64
pd.factorize(df.ssn)[0]
Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64)
In sklearn, if you are doing ML , I will recommend this approach
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.ssn).transform(df.ssn)
Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With