Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving the speed of a python script

Tags:

python

I have an input file with containing a list of strings.

I am iterating through every fourth line starting on line two.

From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.

The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.

def method():
    target = open(output_file, 'w')

    with open(input_file, 'r') as f:
        lineCharsList = []

        for line in f:
            #Make string from first and last 6 characters of a line
            lineChars = line[0:6]+line[145:151] 

            if not (lineChars in lineCharsList):
                lineCharsList.append(lineChars)

                target.write(lineChars + '\n') #If string is unique, write to output file

            for skip in range(3): #Used to step through four lines at a time
                try:
                    check = line    #Check for additional lines in file
                    next(f)
                except StopIteration:
                    break
    target.close()
like image 299
The Nightman Avatar asked Jul 09 '15 02:07

The Nightman


People also ask

Why is my Python script so slow?

In summary: code is slowed down by the compilation and interpretation that occurs during runtime. Compare this to a statically typed, compiled language which runs just the CPU instructions once compilated. It's actually possible to extend Python with compiled modules that are written in C.

Can Python be made faster?

To speed your Python programs, we can implement the Python multiprocessing modules or use C code as a Python extension, as explained earlier. You can also use a JIT compiler such as Numba if you're using NumPy. Numba is a just-in-time JIT compiler that uses decorators to convert Python and NumPy codes to machine code.


2 Answers

Try defining lineCharsList as a set instead of a list:

lineCharsList = set()
...
lineCharsList.add(lineChars)

That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.

like image 179
Óscar López Avatar answered Oct 08 '22 11:10

Óscar López


You can use https://docs.python.org/2/library/itertools.html#itertools.islice:

import itertools

def method():
    with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
        seen = set()
        for line in itertools.islice(inf, None, None, 4):
            s = line[:6]+line[-6:]
            if s not in seen:
                seen.add(s)
                ouf.write("{}\n".format(s))
like image 42
dting Avatar answered Oct 08 '22 11:10

dting