Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Date list in text

Tags:

python

I have a text document with 32 articles in it and I want to spot each article's date. I have observed that the date comes on the 5th row of each article. So far I have split the text into the 32 articles using:

import re 
sections = [] 
current = []
with open("Aberdeen2005.txt") as f:
    for line in f:
        if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
           sections.append("".join(current))
           current = [line]
        else:
           current.append(line)

print(len(sections)) 

I will like to create a list that contains the date for each article, MONTH and YEAR only:enter image description here

As it can be seen, date comes in the format from the above picture, but sometimes the day is not included, e.g. Thursday.

Any ideas?

Kind regards,

Andres

Ps. Here is another example of the 16 document: enter image description here

like image 580
Economist_Ayahuasca Avatar asked Oct 30 '22 10:10

Economist_Ayahuasca


1 Answers

Using regex underneath the if statement you could replace the day:

regx = re.compile(ur'(\w+\s\d{1,2},\s\d{4})\s\w{6,9}')
line = re.sub(regx, "\\1", line)

Example:

https://regex101.com/r/pJ0nZ8/1

linecache method:

Using the linecache module you can specifically capture line 5 and write it to a file; if a date includes the weekday it will be truncated. It's possible to do a lot more with this functionality, although I'll leave the finer details up to you.

import linecache

w = 'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'
l = linecache.getline("Aberdeen2005.txt",5)
m = [d in l for d in w]
c = '2005','2016' # years (optional)

if any(y in l for y in c): # check for years (optional)

    if any(x in l for x in w):
        r = [i for i,v in enumerate(m,0) if v]
        l = l.replace(' '+w[r[0]],'')

    with open("dates.txt", "a") as article_dates:
        article_dates.write(l)

linecache.clearcache()
like image 86
l'L'l Avatar answered Nov 15 '22 03:11

l'L'l