Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Replace a string with 'other' if it is not present in a list of strings

I have the following data frame, df, with column 'Class'

    Class
0   Individual
1   Group
2   A
3   B
4   C
5   D
6   Group

I would like to replace everything apart from Group and Individual with 'Other', so the final data frame is

    Class
0   Individual
1   Group
2   Other
3   Other
4   Other
5   Other
6   Group

The dataframe is huge, with over 600 K rows. What is the best way to optimally look for values other than 'Group' and 'Individual' and replace them with 'Other'?

I have seen examples for replace, such as:

df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})

but since the sheer amount of unique values i have are too many i cannot individually do this. I want to rather just use the exclude subset of 'Group' and 'Individual'.

like image 650
redwolf_cr7 Avatar asked Jul 13 '18 10:07

redwolf_cr7


People also ask

How do you replace a string in a list with another string?

Replace a specific string in a list. If you want to replace the string of elements of a list, use the string method replace() for each element with the list comprehension. If there is no string to be replaced, applying replace() will not change it, so you don't need to select an element with if condition .

How replace multiple strings in pandas?

Pandas replace multiple values in column replace. By using DataFrame. replace() method we will replace multiple values with multiple new strings or text for an individual DataFrame column. This method searches the entire Pandas DataFrame and replaces every specified value.

How do I replace a value in a column with another value in pandas?

DataFrame. replace() function is used to replace values in column (one value with another value on all columns). This method takes to_replace, value, inplace, limit, regex and method as parameters and returns a new DataFrame. When inplace=True is used, it replaces on existing DataFrame object and returns None value.

What is regex in replace pandas?

Replace function for regex This pattern represents a generic sequence of characters. regex : For pandas to interpret the replacement as regular expression replacement, set it to True. value : This represents the value to be replaced in place of to_replace values.


1 Answers

I think you need:

df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
        Class
0  Individual
1       Group
2       Other
3       Other
4       Other
5       Other
6       Group

Another solution (slower):

m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')

Another solution:

df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')

Performance (in real data depends of number of replacements):

#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)

In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 171
jezrael Avatar answered Nov 08 '22 21:11

jezrael