I have a bunch of tweets in plaintext form that is shown below. I am looking to extract the text part only. SAMPLE DATA IN FILE - <pre class="prettyprint"><code>Fri Nov 13 20:27:16 +0000 2015 4181010297 rt we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273 Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19 Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2! Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to next week? Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue? @ golden bee </code></pre> This is my attempt at the preprocess stage - <pre class="prettyprint"><code>for filename in glob.glob('*.txt'): with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile: for tweet in infile.readlines(): temp=tweet.split(' ') text="" for i in temp: x=str(i) if x.isalpha() : text += x + ' ' print(text) </code></pre> OUTPUT- <pre class="prettyprint"><code>Fri Nov rt treating one of you lads to this denim simply follow rt to Fri Nov this album is so proud of i loved this it really is the Fri Nov international break is garbage boring and your players get Fri Nov get weather updates from the weather Fri Nov woah what happened to twitter this update is Fri Nov completed the daily quest in paradise island Fri Nov new henderson memorial public Fri Nov going to next Fri Nov why so golden </code></pre> This output is not the desired output because 1. It will not let me fetch numbers/digits within the text part of the tweet. 2. Every line starts with FRI NOV. Could you please suggest a better method to achieve the same? I am not too familiar with regex, but I assume we could employ <code>re.search(r'2015(magic to remove tweetID)/w*',tweet)</code>

I propose a little more specific pattern than @Rushy Panchal to avoid issues when tweets include digits: <code>.+ \+(\d+ ){3}</code> Use re.sub function <pre class="prettyprint"><code>>>> import re >>> with open('your_file.txt','r') as file: ... data = file.read() ... print re.sub('.+ \+(\d+ ){3}','',data) </code></pre> Output <pre class="prettyprint"><code>rt we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273 international break is garbage smh. it's boring and your players get injured get weather updates from the weather channel. 15:27:19 woah what happened to twitter this update is horrible i've completed the daily quest in paradise island 2! new post: henderson memorial public library who's going to next week? why so blue? @ golden bee </code></pre>

How to fetch a substring from text file in python?

Tags:

python

string

text

I have a bunch of tweets in plaintext form that is shown below. I am looking to extract the text part only.

SAMPLE DATA IN FILE -

Fri Nov 13 20:27:16 +0000 2015 4181010297 rt     we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter
Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured
Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19
Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible
Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2!
Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library
Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to  next week?
Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue?    @ golden bee

This is my attempt at the preprocess stage -

for filename in glob.glob('*.txt'):
    with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
        for tweet in infile.readlines():
            temp=tweet.split(' ')
            text=""
            for i in temp:
                x=str(i)
                if x.isalpha() :
                    text += x + ' '
            print(text)

OUTPUT-

Fri Nov rt treating one of you lads to this denim simply follow rt to 
Fri Nov this album is so proud of i loved this it really is the 
Fri Nov international break is garbage boring and your players get 
Fri Nov get weather updates from the weather 
Fri Nov woah what happened to twitter this update is 
Fri Nov completed the daily quest in paradise island 
Fri Nov new henderson memorial public 
Fri Nov going to next 
Fri Nov why so golden

This output is not the desired output because

1. It will not let me fetch numbers/digits within the text part of the tweet.
2. Every line starts with FRI NOV.

Could you please suggest a better method to achieve the same? I am not too familiar with regex, but I assume we could employ re.search(r'2015(magic to remove tweetID)/w*',tweet)

546

asked Apr 25 '16 20:04

Ic3fr0g

3 Answers

You can avoid regular expressions in this case. The lines of the text you've presented are consistent in terms of how many spaces go before the tweet text. Just split():

>>> data = """
   lines with tweets here
"""
>>> for line in data.splitlines():
...     print(line.split(" ", 7)[-1])
... 
rt     we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to  next week?
why so blue?    @ golden bee

answered Oct 19 '22 23:10

alecxe

You can do it without a regular expression

import glob

for filename in glob.glob('file.txt'):
    with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
        for tweet in infile.readlines():
            temp=tweet.split(' ')
            print('{}'.format(' '.join(temp[7:])))

answered Oct 20 '22 01:10

danidee

I propose a little more specific pattern than @Rushy Panchal to avoid issues when tweets include digits: .+ \+(\d+ ){3}

Use re.sub function

>>> import re
>>> with open('your_file.txt','r') as file:
...     data = file.read()
...     print re.sub('.+ \+(\d+ ){3}','',data)

Output

rt     we're treating one of you lads to this d'struct denim shirt! simply follow &amp; rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to  next week?
why so blue?    @ golden bee

answered Oct 20 '22 01:10

cromod

Related questions
                            
                                Change Jupyter notebook version 4.x+ logo
                            
                                Pandas: DataFrame groupby for year/month and return with new DatetimeIndex
                            
                                Can python coverage module conditionally ignore lines in a unit test?
                            
                                Find rows that have same values in another column - Python
                            
                                placing python tuples in function signature
                            
                                OpenCV - Tilted camera and triangulation landmark for stereo vision
                            
                                How to detect if a twin axis has been generated for a matplotlib axis
                            
                                Python Selenium clicking next button until the end
                            
                                The use of model field "verbose name"
                            
                                Calculating and creating percentage column from two columns
                            
                                Loop in Python: Do stuff before first iteration
                            
                                SQLAlchemy order_by string column with int values
                            
                                Django percentage field
                            
                                How do I remove/omit the count column from the dataframe in Pandas?
                            
                                How to add persistent headers in requests calls?
                            
                                How to add an url field to a serializer with Django Rest Framework
                            
                                How to get media_url from tweets using the Tweepy API
                            
                                Matplotlib: Remove scientific notation in subplot
                            
                                Merging a list of Polygons to Multipolygons
                            
                                Dot product sparse matrices

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With