I have a moderately large (~60,000 rows by 15 columns) csv file that I'm working on with pandas. Each row represents an individual and contains personal data. I want to render the data anonymous. One way I want to do so is by replacing values in a particular column where they are rare. I initially tried to do so as follows:
def clean_data(entry):
if df[df.column_name == entry].index.size < 10:
return 'RARE_VALUE'
else:
return entry
df.new_column_name = df.column_name.apply(clean_data)
But running it froze my system every time. This unfortunately means I have no useful debugging data. Does anyone know the correct way to do this? The column contains both strings and null values.
I think you want to groupby column name:
g = df.groupby('column_name')
You can use a filter, for example, to return only those rows who have something in column_name which appears more than 10 times:
g.filter(lambda x: len(x) >= 10)
To overwrite the column with 'RARE_VALUE' you can use transform (which calculates the result once for each group, and spreads it around appropriately):
df.loc[g[col].transform(lambda x: len(x) < 10).astype(bool), col] = 'RARE_VALUE'
As DSM points out, the following trick is much faster:
df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"
Here's some timeit information (to show how impressive DSM's solution is!):
In [21]: g = pd.DataFrame(np.random.randint(1, 100, (1000, 2))).groupby(0)
In [22]: %timeit g.filter(lambda x: len(x) >= 10)
10 loops, best of 3: 67.2 ms per loop
In [23]: %timeit df.loc[g[1].transform(lambda x: len(x) < 10).values.astype(bool), 1]
10 loops, best of 3: 44.6 ms per loop
In [24]: %timeit df.loc[df[1].value_counts()[df[1]].values < 10, 1]
1000 loops, best of 3: 1.57 ms per loop
@Andy Hayden solves the issue in various ways. I would recommend using pipelines for this kind of task though. The following may seem more unwieldy, but it comes in handy if you want to save the whole pipeline as an object, or if you have to generalize predictions on a test set:
class RemoveScarceValuesFeatureEngineer:
def __init__(self, min_occurences):
self._min_occurences = min_occurences
self._column_value_counts = {}
def fit(self, X, y):
for column in X.columns:
self._column_value_counts[column] = X[column].value_counts()
return self
def transform(self, X):
for column in X.columns:
X.loc[self._column_value_counts[column][X[column]].values
< self._min_occurences, column] = "RARE_VALUE"
return X
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
You may find more informations here: Pandas replace rare values
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With