Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check for words from list and remove those words in pandas dataframe column

I have a list as follows,

remove_words = ['abc', 'deff', 'pls']

The following is the data frame which I am having with column name 'string'

     data['string']

0    abc stack overflow
1    abc123
2    deff comedy
3    definitely
4    pls lkjh
5    pls1234

I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.

For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,

     data['string']

0    stack overflow
1    abc123
2    comedy
3    definitely
4    lkjh
5    pls1234

In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.

I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.

Thanks

like image 807
haimen Avatar asked Aug 01 '17 21:08

haimen


2 Answers

Try this:

In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))

In [99]: pat
Out[99]: '\\b(?:abc|def|pls)\\b'

In [100]: df['new'] = df['string'].str.replace(pat, '')

In [101]: df
Out[101]:
               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234
like image 197
MaxU - stop WAR against UA Avatar answered Sep 18 '22 13:09

MaxU - stop WAR against UA


Totally taking @MaxU's pattern!

We can use pd.DataFrame.replace by setting the regex parameter to True and passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.

pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])

df.assign(new=df.replace(dict(string={pat: ''}), regex=True))

               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234
like image 20
piRSquared Avatar answered Sep 19 '22 13:09

piRSquared