I have ~200 short text files (50kb) that all have a similar format. I want to find a line in each of those files that contains a certain string and then write that line plus the next three lines (but not rest of the lines in the file) to another text file. I am trying to teach myself python in order to do this and have written a very simple and crude little script to try this out. I am using version 2.6.5, and running the script from Mac terminal:
#!/usr/bin/env python
f = open('Test.txt')
Lines=f.readlines()
searchquery = 'am\n'
i=0
while i < 500:
if Lines[i] == searchquery:
print Lines[i:i+3]
i = i+1
else:
i = i+1
f.close()
This more or less works and prints the output to the screen. But I would like to print the lines to a new file instead, so I tried something like this:
f1 = open('Test.txt')
f2 = open('Output.txt', 'a')
Lines=f1.readlines()
searchquery = 'am\n'
i=0
while i < 500:
if Lines[i] == searchquery:
f2.write(Lines[i])
f2.write(Lines[i+1])
f2.write(Lines[i+2])
i = i+1
else:
i = i+1
f1.close()
f2.close()
However, nothing is written to the file. I also tried
from __future__ import print_function
print(Lines[i], file='Output.txt')
and can't get that to work, either. If anyone can explain what I'm doing wrong or offer some suggestions about what I should try instead I would be really grateful. Also, if you have any suggestions for making the search better I would appreciate those as well. I have been using a test file where the string I want to find is the only text on the line, but in my real files the string that I need is still at the beginning of the line but followed by a bunch of other text, so I think the way I have things set up now won't really work, either.
Thanks, and sorry if this is a super basic question!
As pointed out by @ajon, I don't think there's anything fundamentally wrong with your code except the indentation. With the indentation fixed it works for me. However there's a couple opportunities for improvement.
1) In Python, the standard way of iterating over things is by using a for
loop. When using a for
loop, you don't need to define loop counter variables and keep track of them yourself in order to iterate over things. Instead, you write something like this
for line in lines:
print line
to iterate over all the items in a list of strings and print them.
2) In most cases this is what your for
loops will look like. However, there's situations where you actually do want to keep track of the loop count. Your case is such a situation, because you not only need that one line but also the next three, and therefore need to use the counter for indexing (lst[i]
). For that there's enumerate()
, which will return a list of items and their index over which you then can loop.
for i, line in enumerate(lines):
print i
print line
print lines[i+7]
If you were to manually keep track of the loop counter as in your example, there's two things:
3) That i = i+1
should be moved out of the if
and else
blocks. You're doing it in both cases, so put it after the if/else
. In your case the else
block then doesn't do anything any more, and can be eliminated:
while i < 500:
if Lines[i] == searchquery:
f2.write(Lines[i])
f2.write(Lines[i+1])
f2.write(Lines[i+2])
i = i+1
4) Now, this will cause an IndexError
with files shorter than 500 lines. Instead of hard coding a loop count of 500, you should use the actual length of the sequence you're iterating over. len(lines)
will give you that length. But instead of using a while
loop, use a for
loop and range(len(lst))
to iterate over a list of the range from zero to len(lst) - 1
.
for i in range(len(lst)):
print lst[i]
5) open()
can be used as a context manager that takes care of closing files for you. context managers are a rather advanced concept but are pretty simple to use if they're already provided for you. By doing something like this
with open('test.txt') as f:
f.write('foo')
the file will be opened and accessible to you as f
inside that with
block. After you leave the block the file will be automatically closed, so you can't end up forgetting to close the file.
In your case you're opening two files. This can be done by just using two with
statements and nest them
with open('one.txt') as f1:
with open('two.txt') as f2:
f1.write('foo')
f2.write('bar')
or, in Python 2.7 / Python 3.x, by nesting two context manager in a single with
statement:
with open('one.txt') as f1, open('two.txt', 'a') as f2:
f1.write('foo')
f2.write('bar')
6) Depending on the operating system the file was created on, line endings are different. On UNIX-like platforms it's \n
, Macs before OS X used \r
, and Windows uses \r\n
. So that Lines[i] == searchquery
will not match for Mac or Windows line endings. file.readline()
can deal with all three, but because it keeps whatever line endings were there at the end of the line, the comparison will fail. This is solved by using str.strip()
, which will strip the string of all whitespace at the beginning and the end, and compare a search pattern without the line ending to that:
searchquery = 'am'
# ...
if line.strip() == searchquery:
# ...
(Reading the file using file.read()
and using str.splitlines()
would be another alternative.)
But, since you mentioned your search string actually appears at the beginning of the line, lets do that, by using str.startswith()
:
if line.startswith(searchquery):
# ...
7) The official style guide for Python, PEP8, recommends to use CamelCase
for classes, lowercase_underscore
for pretty much everything else (variables, functions, attributes, methods, modules, packages). So instead of Lines
use lines
. This is definitely a minor point compared to the others, but still worth getting right early on.
So, considering all those things I would write your code like this:
searchquery = 'am'
with open('Test.txt') as f1:
with open('Output.txt', 'a') as f2:
lines = f1.readlines()
for i, line in enumerate(lines):
if line.startswith(searchquery):
f2.write(line)
f2.write(lines[i + 1])
f2.write(lines[i + 2])
As @TomK pointed out, all this code assumes that if your search string matches, there's at least two lines following it. If you can't rely on that assumption, dealing with that case by using a try...except
block like @poorsod suggested is the right way to go.
I think your problem is the tabs of the bottom file.
You need to indent from if Lines[i]
until after i=i+1
such as:
while i < 500:
if Lines[i] == searchquery:
f2.write(Lines[i])
f2.write(Lines[i+1])
f2.write(Lines[i+2])
i = i+1
else:
i = i+1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With