I'm trying to parse a new line delimited text file into blocks of lines, which are appended to a .txt file. I'd like to be able to grab x amount of lines AFTER my ending string, as these lines will vary in content, meaning setting the 'end string' to try to match it would miss lines.
Example of file:
"Start"
"..."
"..."
"..."
"..."
"---" ##End here
"xxx" ##Unique data here
"xxx" ##And here
And here's the code
first = "Start"
first_end = "---"
with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
copy = False
for line in infile:
if line.strip().startswith(first):
copy = True
outfile.write(line)
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
##Want to also write next 2 lines here
elif copy:
outfile.write(line)
Is there any way to do this using for line in infile, or do I need to use a different type of loop?
You can use next or readline (in Python 3 and up) to retrieve the next line in the file:
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
outfile.write(next(infile))
outfile.write(next(infile))
or
#note: not compatible with Python 2.7 and below
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
outfile.write(infile.readline())
outfile.write(infile.readline())
This will also cause the file pointer to advance two additional lines, so the next iteration of for line in infile: will skip past the two lines you read with readline.
Bonus terminology nitpick: a file object is not a list, and methods for accessing the x+1th element of a list might not work for accessing the next line of a file, and vice versa. If you did want to access the next item of a proper list object, you could use enumerate so you can perform arithmetic on the list's index. For example:
seq = ["foo", "bar", "baz", "qux", "troz", "zort"]
#find all instances of "baz" and also the first two elements after "baz"
for idx, item in enumerate(seq):
if item == "baz":
print(item)
print(seq[idx+1])
print(seq[idx+2])
Note that, unlike readline, indexing will not advance the iterator, so for idx, item in enumerate(seq): will still iterate over "qux" and "troz".
An approach that works on any iterable is to use an additional variable to keep track of state across iterations. The advantage of this is that you don't have to know anything about how to manually advance iterables; the disadvantage is that reasoning about the logic within the loop is more difficult because it exposes an additional side-effect.
first = "Start"
first_end = "---"
with open('testlog.log') as infile, open('parsed.txt', 'a') as outfile:
copy = False
num_items_to_write = 0
for line in infile:
if num_items_to_write > 0:
outfile.write(line)
num_items_to_write -= 1
elif line.strip().startswith(first):
copy = True
outfile.write(line)
elif line.strip().startswith(first_end):
copy = False
outfile.write(line)
num_items_to_write = 2
elif copy:
outfile.write(line)
In the specific case of pulling repetitive groups of data out of a delimited file, it might be appropriate to skip iteration entirely and use regex instead. For data like yours, that might look like:
import re
with open("testlog.log") as file:
data = file.read()
pattern = re.compile(r"""
^Start$ #"Start" by itself on a line
(?:\n.*$)*? #zero or more lines, matched non-greedily
#use (?:) for all groups so `findall` doesn't capture them later
\n---$ #"---" by itself on a line
(?:\n.*$){2} #exactly two lines
""", re.MULTILINE | re.VERBOSE)
#equivalent one-line regex:
#pattern = re.compile("^Start$(?:\n.*$)*?\n---$(?:\n.*$){2}", re.MULTILINE)
for group in pattern.findall(data):
print("Found group:")
print(group)
print("End of group.\n\n")
When run on a log that looks like:
Start
foo
bar
baz
qux
---
troz
zort
alice
bob
carol
dave
Start
Fred
Barney
---
Wilma
Betty
Pebbles
... This will produce the output:
Found group:
Start
foo
bar
baz
qux
---
troz
zort
End of group.
Found group:
Start
Fred
Barney
---
Wilma
Betty
End of group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With