I am currently doing an experiment on a dataset using differential privacy concepts. So, I am trying to implement one of the mechanisms of differential privacy namely Laplace mechanisms using a sample dataset from UCI Machine Repository and python programming language.
Let's assume that we have simple counting query where we want to know the number of people who earns '<=50k' which are grouped by their 'occupation'
SELECT
adult.occupation, COUNT(adult.salary_group) As NumofPeople
FROM
adult
WHERE
adult.salary_group = '<=50K'
GROUP BY
adult.occupation, adult.salary_group;
and this is the Laplace function I am trying to use
import numpy as np
def laplaceMechanism(x, epsilon):
x += np.random.laplace(0, 1.0/epsilon, 1)[0]
return x
So, my question is how could I apply the function against the the data I got if we take epsilon=2
, I know that Laplace Mechanism works by adding a random noise from the la place distribution to the true answer we get from the query. A bit of insight would be appreciated...
Assuming you have already loaded the csv from the link into a database to conduct the sql query, you can apply your Laplacian function by first loading the results of the query into a pandas dataframe using pandas.readsql()
:
import pandas as pd
query = '''SELECT
adult.occupation, COUNT(adult.salary_group) As NumofPeople
FROM
adult
WHERE
adult.salary_group = '<=50K'
GROUP BY
adult.occupation, adult.salary_group;'''
df = pd.read_sql(query, '<database-connection-string>')
Then you can apply your function using pandas.Series.apply()
using args
to pass in your epsilon:
df['NumOfPeople]' = df['NumOfPeople'].apply(laplaceMechanism, args=(2,))
The above would obviously replace the NumOfPeople
column with the adjusted values, you could choose to keep the new series separate, attach them to the dataframe as a new column with a different name, or clone the dataframe first to keep the old dataframe around too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With