Pandas: Replace a string with 'other' if it is not present in a list of strings

Tags:

I have the following data frame, df, with column 'Class'

    Class
0   Individual
1   Group
2   A
3   B
4   C
5   D
6   Group

I would like to replace everything apart from Group and Individual with 'Other', so the final data frame is

Click to copy

    Class
0   Individual
1   Group
2   Other
3   Other
4   Other
5   Other
6   Group

The dataframe is huge, with over 600 K rows. What is the best way to optimally look for values other than 'Group' and 'Individual' and replace them with 'Other'?

I have seen examples for replace, such as:

Click to copy

df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})

but since the sheer amount of unique values i have are too many i cannot individually do this. I want to rather just use the exclude subset of 'Group' and 'Individual'.

650

asked Jul 13 '18 10:07

redwolf_cr7

1 Answers

I think you need:

Click to copy

df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
        Class
0  Individual
1       Group
2       Other
3       Other
4       Other
5       Other
6       Group

Another solution (slower):

Click to copy

m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')

Another solution:

Click to copy

df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')

Performance (in real data depends of number of replacements):

Click to copy

#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)

In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

171

answered Nov 08 '22 21:11

jezrael

Related questions
                            
                                OpenCV MatchTemplate in C# is too slow compared to Python
                            
                                When would the python tracemalloc module allocations statistics not match what's shown in ps or pmap?
                            
                                Keras: How to get layer index when already know layer name?
                            
                                What does the parenthesis after the function mean
                            
                                django - prefetch only the newest record?
                            
                                How to extract rar files inside google colab
                            
                                What is the best way in python to write docstrings for lambda functions?
                            
                                Assign value to specific cell in PySpark dataFrame
                            
                                What does tqdm's total parameter do?
                            
                                Django and Folium integration
                            
                                How to pass additional parameters to handle_client coroutine?
                            
                                How to target data attribute with Scrapy
                            
                                Python3 __pycache__ generating even if PYTHONDONTWRITEBYTECODE=1
                            
                                Scipy sigmoid curve fitting
                            
                                Merge list into sparse list efficiently
                            
                                What is the difference between APIView class and generics.GenericAPIView
                            
                                Auto-build an Mkdocs documentation in Travis CI
                            
                                Django: annotate Sum Case When depending on the status of a field
                            
                                Python Difflib's SequenceMatcher does not find Longest Common Substrings
                            
                                How to ignore pymysql warnings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Replace a string with 'other' if it is not present in a list of strings

Tags:

python

python-3.x

pandas

redwolf_cr7

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us