Replace rarely occurring values in a pandas dataframe

Question

I have a moderately large (~60,000 rows by 15 columns) csv file that I'm working on with pandas. Each row represents an individual and contains personal data. I want to render the data anonymous. One way I want to do so is by replacing values in a particular column where they are rare. I initially tried to do so as follows:

def clean_data(entry):
    if df[df.column_name == entry].index.size < 10:
        return 'RARE_VALUE'
    else:
        return entry

df.new_column_name = df.column_name.apply(clean_data)

But running it froze my system every time. This unfortunately means I have no useful debugging data. Does anyone know the correct way to do this? The column contains both strings and null values.

Andy Hayden · Accepted Answer

I think you want to groupby column name:

g = df.groupby('column_name')

You can use a filter, for example, to return only those rows who have something in column_name which appears more than 10 times:

g.filter(lambda x: len(x) >= 10)

To overwrite the column with 'RARE_VALUE' you can use transform (which calculates the result once for each group, and spreads it around appropriately):

df.loc[g[col].transform(lambda x: len(x) < 10).astype(bool), col] = 'RARE_VALUE'

As DSM points out, the following trick is much faster:

df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"

Here's some timeit information (to show how impressive DSM's solution is!):

In [21]: g = pd.DataFrame(np.random.randint(1, 100, (1000, 2))).groupby(0)

In [22]: %timeit g.filter(lambda x: len(x) >= 10)
10 loops, best of 3: 67.2 ms per loop

In [23]: %timeit df.loc[g[1].transform(lambda x: len(x) < 10).values.astype(bool), 1]
10 loops, best of 3: 44.6 ms per loop

In [24]: %timeit df.loc[df[1].value_counts()[df[1]].values < 10, 1]
1000 loops, best of 3: 1.57 ms per loop

RUser4512 · Answer

@Andy Hayden solves the issue in various ways. I would recommend using pipelines for this kind of task though. The following may seem more unwieldy, but it comes in handy if you want to save the whole pipeline as an object, or if you have to generalize predictions on a test set:

class RemoveScarceValuesFeatureEngineer:

    def __init__(self, min_occurences):
        self._min_occurences = min_occurences
        self._column_value_counts = {}

    def fit(self, X, y):
        for column in X.columns:
            self._column_value_counts[column] = X[column].value_counts()
        return self

    def transform(self, X):
        for column in X.columns:
            X.loc[self._column_value_counts[column][X[column]].values
                  < self._min_occurences, column] = "RARE_VALUE"

        return X

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

You may find more informations here: Pandas replace rare values

Replace rarely occurring values in a pandas dataframe

Tags:

python

pandas

Garry Cairns

2 Answers

Andy Hayden

RUser4512

Recent Activity

Donate For Us

Replace rarely occurring values in a pandas dataframe

Tags:

python

pandas

Garry Cairns

2 Answers

Andy Hayden

RUser4512

Related questions

Recent Activity

Donate For Us