Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementation techinque for differential privacy

I am currently doing an experiment on a dataset using differential privacy concepts. So, I am trying to implement one of the mechanisms of differential privacy namely Laplace mechanisms using a sample dataset from UCI Machine Repository and python programming language.
Let's assume that we have simple counting query where we want to know the number of people who earns '<=50k' which are grouped by their 'occupation'

SELECT 
   adult.occupation, COUNT(adult.salary_group) As NumofPeople 
FROM 
   adult
WHERE 
   adult.salary_group = '<=50K'
GROUP BY 
   adult.occupation, adult.salary_group;

and this is the Laplace function I am trying to use

import numpy as np

def laplaceMechanism(x, epsilon):
    x +=  np.random.laplace(0, 1.0/epsilon, 1)[0]
return x

So, my question is how could I apply the function against the the data I got if we take epsilon=2, I know that Laplace Mechanism works by adding a random noise from the la place distribution to the true answer we get from the query. A bit of insight would be appreciated...

like image 215
fudu Avatar asked Nov 19 '16 15:11

fudu


1 Answers

Assuming you have already loaded the csv from the link into a database to conduct the sql query, you can apply your Laplacian function by first loading the results of the query into a pandas dataframe using pandas.readsql():

import pandas as pd

query =  '''SELECT 
   adult.occupation, COUNT(adult.salary_group) As NumofPeople 
FROM 
   adult
WHERE 
   adult.salary_group = '<=50K'
GROUP BY 
   adult.occupation, adult.salary_group;'''

df = pd.read_sql(query, '<database-connection-string>')

Then you can apply your function using pandas.Series.apply() using args to pass in your epsilon:

df['NumOfPeople]' = df['NumOfPeople'].apply(laplaceMechanism, args=(2,))

The above would obviously replace the NumOfPeople column with the adjusted values, you could choose to keep the new series separate, attach them to the dataframe as a new column with a different name, or clone the dataframe first to keep the old dataframe around too.

like image 194
David Dean Avatar answered Oct 13 '22 20:10

David Dean