Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using python to write specific lines from one file to another file

Tags:

python

I have ~200 short text files (50kb) that all have a similar format. I want to find a line in each of those files that contains a certain string and then write that line plus the next three lines (but not rest of the lines in the file) to another text file. I am trying to teach myself python in order to do this and have written a very simple and crude little script to try this out. I am using version 2.6.5, and running the script from Mac terminal:

#!/usr/bin/env python

f = open('Test.txt')

Lines=f.readlines()
searchquery = 'am\n'
i=0

while i < 500:
    if Lines[i] == searchquery:
        print Lines[i:i+3]
        i = i+1
    else:
        i = i+1
f.close()

This more or less works and prints the output to the screen. But I would like to print the lines to a new file instead, so I tried something like this:

f1 = open('Test.txt')
f2 = open('Output.txt', 'a')

Lines=f1.readlines()
searchquery = 'am\n'
i=0

while i < 500:
if Lines[i] == searchquery:
    f2.write(Lines[i])
    f2.write(Lines[i+1])
    f2.write(Lines[i+2])
    i = i+1
else:
    i = i+1
f1.close()
f2.close()

However, nothing is written to the file. I also tried

from __future__ import print_function
print(Lines[i], file='Output.txt')

and can't get that to work, either. If anyone can explain what I'm doing wrong or offer some suggestions about what I should try instead I would be really grateful. Also, if you have any suggestions for making the search better I would appreciate those as well. I have been using a test file where the string I want to find is the only text on the line, but in my real files the string that I need is still at the beginning of the line but followed by a bunch of other text, so I think the way I have things set up now won't really work, either.

Thanks, and sorry if this is a super basic question!

like image 713
Andreanna Avatar asked Oct 06 '12 00:10

Andreanna


2 Answers

As pointed out by @ajon, I don't think there's anything fundamentally wrong with your code except the indentation. With the indentation fixed it works for me. However there's a couple opportunities for improvement.

1) In Python, the standard way of iterating over things is by using a for loop. When using a for loop, you don't need to define loop counter variables and keep track of them yourself in order to iterate over things. Instead, you write something like this

for line in lines:
    print line

to iterate over all the items in a list of strings and print them.

2) In most cases this is what your for loops will look like. However, there's situations where you actually do want to keep track of the loop count. Your case is such a situation, because you not only need that one line but also the next three, and therefore need to use the counter for indexing (lst[i]). For that there's enumerate(), which will return a list of items and their index over which you then can loop.

for i, line in enumerate(lines):
    print i
    print line
    print lines[i+7]

If you were to manually keep track of the loop counter as in your example, there's two things:

3) That i = i+1 should be moved out of the if and else blocks. You're doing it in both cases, so put it after the if/else. In your case the else block then doesn't do anything any more, and can be eliminated:

while i < 500:
    if Lines[i] == searchquery:
        f2.write(Lines[i])
        f2.write(Lines[i+1])
        f2.write(Lines[i+2])
    i = i+1

4) Now, this will cause an IndexError with files shorter than 500 lines. Instead of hard coding a loop count of 500, you should use the actual length of the sequence you're iterating over. len(lines) will give you that length. But instead of using a while loop, use a for loop and range(len(lst)) to iterate over a list of the range from zero to len(lst) - 1.

for i in range(len(lst)):
    print lst[i]

5) open() can be used as a context manager that takes care of closing files for you. context managers are a rather advanced concept but are pretty simple to use if they're already provided for you. By doing something like this

with open('test.txt') as f:
    f.write('foo')

the file will be opened and accessible to you as f inside that with block. After you leave the block the file will be automatically closed, so you can't end up forgetting to close the file.

In your case you're opening two files. This can be done by just using two with statements and nest them

with open('one.txt') as f1:
    with open('two.txt') as f2:
        f1.write('foo')
        f2.write('bar')

or, in Python 2.7 / Python 3.x, by nesting two context manager in a single with statement:

    with open('one.txt') as f1, open('two.txt', 'a') as f2:
        f1.write('foo')
        f2.write('bar')

6) Depending on the operating system the file was created on, line endings are different. On UNIX-like platforms it's \n, Macs before OS X used \r, and Windows uses \r\n. So that Lines[i] == searchquery will not match for Mac or Windows line endings. file.readline() can deal with all three, but because it keeps whatever line endings were there at the end of the line, the comparison will fail. This is solved by using str.strip(), which will strip the string of all whitespace at the beginning and the end, and compare a search pattern without the line ending to that:

searchquery = 'am'
# ...
            if line.strip() == searchquery:
                # ...

(Reading the file using file.read() and using str.splitlines() would be another alternative.)

But, since you mentioned your search string actually appears at the beginning of the line, lets do that, by using str.startswith():

if line.startswith(searchquery):
    # ...

7) The official style guide for Python, PEP8, recommends to use CamelCase for classes, lowercase_underscore for pretty much everything else (variables, functions, attributes, methods, modules, packages). So instead of Lines use lines. This is definitely a minor point compared to the others, but still worth getting right early on.


So, considering all those things I would write your code like this:

searchquery = 'am'

with open('Test.txt') as f1:
    with open('Output.txt', 'a') as f2:
        lines = f1.readlines()
        for i, line in enumerate(lines):
            if line.startswith(searchquery):
                f2.write(line)
                f2.write(lines[i + 1])
                f2.write(lines[i + 2])

As @TomK pointed out, all this code assumes that if your search string matches, there's at least two lines following it. If you can't rely on that assumption, dealing with that case by using a try...except block like @poorsod suggested is the right way to go.

like image 108
Lukas Graf Avatar answered Nov 16 '22 23:11

Lukas Graf


I think your problem is the tabs of the bottom file.

You need to indent from if Lines[i] until after i=i+1 such as:

while i < 500:
    if Lines[i] == searchquery:
        f2.write(Lines[i])
        f2.write(Lines[i+1])
        f2.write(Lines[i+2])
        i = i+1
    else:
        i = i+1
like image 33
ajon Avatar answered Nov 17 '22 01:11

ajon