I have a text document with 32 articles in it and I want to spot each article's date. I have observed that the date comes on the 5th row of each article. So far I have split the text into the 32 articles using:
import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
I will like to create a list that contains the date for each article, MONTH and YEAR only:
As it can be seen, date comes in the format from the above picture, but sometimes the day is not included, e.g. Thursday.
Any ideas?
Kind regards,
Andres
Ps. Here is another example of the 16 document:
Using regex underneath the if
statement you could replace the day:
regx = re.compile(ur'(\w+\s\d{1,2},\s\d{4})\s\w{6,9}')
line = re.sub(regx, "\\1", line)
Example:
https://regex101.com/r/pJ0nZ8/1
linecache method:
Using the linecache
module you can specifically capture line 5 and write it to a file; if a date includes the weekday it will be truncated. It's possible to do a lot more with this functionality, although I'll leave the finer details up to you.
import linecache
w = 'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'
l = linecache.getline("Aberdeen2005.txt",5)
m = [d in l for d in w]
c = '2005','2016' # years (optional)
if any(y in l for y in c): # check for years (optional)
if any(x in l for x in w):
r = [i for i,v in enumerate(m,0) if v]
l = l.replace(' '+w[r[0]],'')
with open("dates.txt", "a") as article_dates:
article_dates.write(l)
linecache.clearcache()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With