Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace rarely occurring values in a pandas dataframe

Tags:

python

pandas

I have a moderately large (~60,000 rows by 15 columns) csv file that I'm working on with pandas. Each row represents an individual and contains personal data. I want to render the data anonymous. One way I want to do so is by replacing values in a particular column where they are rare. I initially tried to do so as follows:

def clean_data(entry):
    if df[df.column_name == entry].index.size < 10:
        return 'RARE_VALUE'
    else:
        return entry

df.new_column_name = df.column_name.apply(clean_data)

But running it froze my system every time. This unfortunately means I have no useful debugging data. Does anyone know the correct way to do this? The column contains both strings and null values.

like image 513
Garry Cairns Avatar asked Mar 05 '14 20:03

Garry Cairns


2 Answers

I think you want to groupby column name:

g = df.groupby('column_name')

You can use a filter, for example, to return only those rows who have something in column_name which appears more than 10 times:

g.filter(lambda x: len(x) >= 10)

To overwrite the column with 'RARE_VALUE' you can use transform (which calculates the result once for each group, and spreads it around appropriately):

df.loc[g[col].transform(lambda x: len(x) < 10).astype(bool), col] = 'RARE_VALUE'

As DSM points out, the following trick is much faster:

df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"

Here's some timeit information (to show how impressive DSM's solution is!):

In [21]: g = pd.DataFrame(np.random.randint(1, 100, (1000, 2))).groupby(0)

In [22]: %timeit g.filter(lambda x: len(x) >= 10)
10 loops, best of 3: 67.2 ms per loop

In [23]: %timeit df.loc[g[1].transform(lambda x: len(x) < 10).values.astype(bool), 1]
10 loops, best of 3: 44.6 ms per loop

In [24]: %timeit df.loc[df[1].value_counts()[df[1]].values < 10, 1]
1000 loops, best of 3: 1.57 ms per loop
like image 124
Andy Hayden Avatar answered Oct 26 '22 19:10

Andy Hayden


@Andy Hayden solves the issue in various ways. I would recommend using pipelines for this kind of task though. The following may seem more unwieldy, but it comes in handy if you want to save the whole pipeline as an object, or if you have to generalize predictions on a test set:

class RemoveScarceValuesFeatureEngineer:

    def __init__(self, min_occurences):
        self._min_occurences = min_occurences
        self._column_value_counts = {}

    def fit(self, X, y):
        for column in X.columns:
            self._column_value_counts[column] = X[column].value_counts()
        return self

    def transform(self, X):
        for column in X.columns:
            X.loc[self._column_value_counts[column][X[column]].values
                  < self._min_occurences, column] = "RARE_VALUE"

        return X

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

You may find more informations here: Pandas replace rare values

like image 43
RUser4512 Avatar answered Oct 26 '22 20:10

RUser4512