Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Twitter mentions from Pandas column

I have a dataset that includes Tweets from Twitter. Some of them also have user mentions such as @thisisauser. I try to remove that text at the same time I do other cleaning processes.

def clean_text(row, options):

    if options['lowercase']:
        row = row.lower()

    if options['decode_html']:
        txt = BeautifulSoup(row, 'lxml')
        row = txt.get_text()

    if options['remove_url']:
        row = row.replace('http\S+|www.\S+', '')

    if options['remove_mentions']:
        row = row.replace('@[A-Za-z0-9]+', '')

    return row

clean_config = {
    'remove_url': True,
    'remove_mentions': True,
    'decode_utf8': True,
    'lowercase': True
    }

df['tweet'] = df['tweet'].apply(clean_text, args=(clean_config,))

However, when I run the above code, all the Twitter mentions are still on the text. I verified with a Regex online tool that my Regex is working correctly, so the problem should be on the Pandas's code.

like image 497
Tasos Avatar asked Feb 17 '19 13:02

Tasos


2 Answers

You are misusing replace method on a string because it does not accept regular expressions, only fixed strings (see docs at https://docs.python.org/2/library/stdtypes.html#str.replace for more).

The right way of achieving your needs is using re module like:

import re
re.sub("@[A-Za-z0-9]+","", "@thisisauser text")
' text'
like image 160
Sergey Bushmanov Avatar answered Oct 20 '22 05:10

Sergey Bushmanov


the problem is with the way you used replace method & not pandas

see output from the REPL

>>> my_str ="@thisisause"
>>> my_str.replace('@[A-Za-z0-9]+', '')
'@thisisause'

replace doesn't support regex. Instead do use regular expressions library in python as mentioned in the answer

>>> import re
>>> my_str
'hello @username hi'
>>> re.sub("@[A-Za-z0-9]+","",my_str)
'hello  hi'
like image 23
stormfield Avatar answered Oct 20 '22 05:10

stormfield