I have the following data frame, df, with column 'Class'
Class
0 Individual
1 Group
2 A
3 B
4 C
5 D
6 Group
I would like to replace everything apart from Group and Individual with 'Other', so the final data frame is
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
The dataframe is huge, with over 600 K rows. What is the best way to optimally look for values other than 'Group' and 'Individual' and replace them with 'Other'?
I have seen examples for replace, such as:
df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})
but since the sheer amount of unique values i have are too many i cannot individually do this. I want to rather just use the exclude subset of 'Group' and 'Individual'.
Replace a specific string in a list. If you want to replace the string of elements of a list, use the string method replace() for each element with the list comprehension. If there is no string to be replaced, applying replace() will not change it, so you don't need to select an element with if condition .
Pandas replace multiple values in column replace. By using DataFrame. replace() method we will replace multiple values with multiple new strings or text for an individual DataFrame column. This method searches the entire Pandas DataFrame and replaces every specified value.
DataFrame. replace() function is used to replace values in column (one value with another value on all columns). This method takes to_replace, value, inplace, limit, regex and method as parameters and returns a new DataFrame. When inplace=True is used, it replaces on existing DataFrame object and returns None value.
Replace function for regex This pattern represents a generic sequence of characters. regex : For pandas to interpret the replacement as regular expression replacement, set it to True. value : This represents the value to be replaced in place of to_replace values.
I think you need:
df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
Another solution (slower):
m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')
Another solution:
df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
Performance (in real data depends of number of replacements):
#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)
In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With