I have a list as follows,
remove_words = ['abc', 'deff', 'pls']
The following is the data frame which I am having with column name 'string'
data['string']
0 abc stack overflow
1 abc123
2 deff comedy
3 definitely
4 pls lkjh
5 pls1234
I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.
For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,
data['string']
0 stack overflow
1 abc123
2 comedy
3 definitely
4 lkjh
5 pls1234
In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.
I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.
Thanks
Try this:
In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))
In [99]: pat
Out[99]: '\\b(?:abc|def|pls)\\b'
In [100]: df['new'] = df['string'].str.replace(pat, '')
In [101]: df
Out[101]:
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
Totally taking @MaxU's pattern!
We can use pd.DataFrame.replace
by setting the regex
parameter to True
and passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
df.assign(new=df.replace(dict(string={pat: ''}), regex=True))
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With