Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep text clean from url

Tags:

python

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL

or

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX

I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc

So the result will be:

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

and

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.

Please help me find a regex pattern that will do what i want.

like image 534
Geralyn Feltner Avatar asked Nov 26 '25 11:11

Geralyn Feltner


1 Answers

This might help.

Demo:

import re

s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""    

def cleanString(text):
    res = []
    for i in text.strip().split():
        if not re.search(r"(https?)", i):   #Removes URL..Note: Works only if http or https in string.
            res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " "))   #Strip everything that is not alphabet(Upper or Lower)
    return " ".join(map(str.strip, res))

print(cleanString(s1))
print(cleanString(s2))
like image 113
Rakesh Avatar answered Nov 29 '25 01:11

Rakesh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!