Split diary file into multiple files using Python

Question

I keep a diary file of tech notes. Each entry is timestamped like so:

# Monday 02012-05-07 at 01:45:20 PM

This is a sample note

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

# Wednesday 02012-06-06 at 03:44:11 PM

Here is another one.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia 
deserunt mollit anim id est laborum.

Would like to break these notes down into individual files based on timestamp headers. e.g. This is a sample note.txt, Here is another really long title.txt. Im sure I would have to truncate the filename at some point, but the idea would be to seed the filename based on the first line of the diary entry.

It doesn't look like I can modify the file's creation date via python, so I would like to preserve the entries timestamp as part of the note's body.

I've got a RegEx pattern to capture the timestamps that suits me well:

#(\s)(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\s)(.*)

and can likely use that regex to loop through the file and break each entry down, but im not quite sure how to loop through the diary file and break it out into individual files. There are a lot of examples of grabbing the actual regex pattern, or particular line, but I want to do a few more things here and am having some difficulty peicing it together.

Here is an example of the desired file contents (datestamp + all text up until next datestamp match):

bash$ cat This\ is\ a\ sample\ note.txt
Monday 02012-05-07 at 01:45:20 PM

This is a sample note

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

bash$

Tim Peters · Accepted Answer

Here's the general ;-) approach:

f = open("diaryfile", "r")
body = []
for line in f:
    if your_regexp.match(line):
        if body:
            write_one(body)
        body = []
    body.append(line)
if body:
    write_one(body)
f.close()

In short, you just keep appending all lines to a list (body). When you find a magical line, you call write_one() to dump what you have so far, and clear the list. The last chunk of the file is a special case, because you're not going to find your magical regexp again. So you again dump what you have after the loop.

You can make any transformations you like in your write_one() function. For example, sounds like you want to remove the leading "# " from the input timestamp lines. That's fine - just do, e.g.,

body[0] = body[0][2:]

in write_one. All the lines can be written out in one gulp via, e.g.,

with open(file_name_extracted_from_body_goes_here, "w") as f:
    f.writelines(body)

You probably want to check that the file doesn't exist first! If it's anything like my diary, the first line of many entries will be "Rotten day." ;-)

Split diary file into multiple files using Python

Tags:

python

regex

text

file-io

Ben Keating

1 Answers

Tim Peters

Recent Activity

Donate For Us

Split diary file into multiple files using Python

Tags:

python

regex

text

file-io

Ben Keating

1 Answers

Tim Peters

Related questions

Recent Activity

Donate For Us