Check for words from list and remove those words in pandas dataframe column

Question

I have a list as follows,

remove_words = ['abc', 'deff', 'pls']

The following is the data frame which I am having with column name 'string'

     data['string']

0    abc stack overflow
1    abc123
2    deff comedy
3    definitely
4    pls lkjh
5    pls1234

I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.

For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,

     data['string']

0    stack overflow
1    abc123
2    comedy
3    definitely
4    lkjh
5    pls1234

In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.

I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.

Thanks

MaxU - stop WAR against UA · Accepted Answer

Try this:

In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))

In [99]: pat
Out[99]: '\b(?:abc|def|pls)\b'

In [100]: df['new'] = df['string'].str.replace(pat, '')

In [101]: df
Out[101]:
               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

piRSquared · Answer

Totally taking @MaxU's pattern!

We can use pd.DataFrame.replace by setting the regex parameter to True and passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.

pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])

df.assign(new=df.replace(dict(string={pat: ''}), regex=True))

               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

Check for words from list and remove those words in pandas dataframe column

Tags:

python

regex

replace

pandas

python-2.7

haimen

2 Answers

MaxU - stop WAR against UA

piRSquared

Recent Activity

Donate For Us

Check for words from list and remove those words in pandas dataframe column

Tags:

python

regex

replace

pandas

python-2.7

haimen

2 Answers

MaxU - stop WAR against UA

piRSquared

Related questions

Recent Activity

Donate For Us