I have a bunch of tweets in plaintext form that is shown below. I am looking to extract the text part only.
SAMPLE DATA IN FILE -
Fri Nov 13 20:27:16 +0000 2015 4181010297 rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured
Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19
Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible
Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2!
Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library
Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to next week?
Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue? @ golden bee
This is my attempt at the preprocess stage -
for filename in glob.glob('*.txt'):
with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
for tweet in infile.readlines():
temp=tweet.split(' ')
text=""
for i in temp:
x=str(i)
if x.isalpha() :
text += x + ' '
print(text)
OUTPUT-
Fri Nov rt treating one of you lads to this denim simply follow rt to
Fri Nov this album is so proud of i loved this it really is the
Fri Nov international break is garbage boring and your players get
Fri Nov get weather updates from the weather
Fri Nov woah what happened to twitter this update is
Fri Nov completed the daily quest in paradise island
Fri Nov new henderson memorial public
Fri Nov going to next
Fri Nov why so golden
This output is not the desired output because
1. It will not let me fetch numbers/digits within the text part of the tweet.
2. Every line starts with FRI NOV.
Could you please suggest a better method to achieve the same? I am not too familiar with regex, but I assume we could employ re.search(r'2015(magic to remove tweetID)/w*',tweet)
To read a text file in Python, you follow these steps: First, open a text file for reading by using the open() function. Second, read text from the text file using the file read() , readline() , or readlines() method of the file object. Third, close the file using the file close() method.
You can seek into the file the file and then read a certain amount from there. Seek allows you to get to a specific offset within a file, and then you can limit your read to only the number of bytes in that range. That will only read that data that you're looking for.
You can use an index number as a line number to extract a set of lines from it. This is the most straightforward way to read a specific line from a file in Python. We read the entire file using this way and then pick specific lines from it as per our requirement.
You can avoid regular expressions in this case. The lines of the text you've presented are consistent in terms of how many spaces go before the tweet text. Just split()
:
>>> data = """
lines with tweets here
"""
>>> for line in data.splitlines():
... print(line.split(" ", 7)[-1])
...
rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to next week?
why so blue? @ golden bee
You can do it without a regular expression
import glob
for filename in glob.glob('file.txt'):
with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
for tweet in infile.readlines():
temp=tweet.split(' ')
print('{}'.format(' '.join(temp[7:])))
I propose a little more specific pattern than @Rushy Panchal to avoid issues when tweets include digits: .+ \+(\d+ ){3}
Use re.sub function
>>> import re
>>> with open('your_file.txt','r') as file:
... data = file.read()
... print re.sub('.+ \+(\d+ ){3}','',data)
Output
rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to next week?
why so blue? @ golden bee
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With