How can I make my Python script faster?

Tags:

I'm pretty new to Python, and I have written a (probably very ugly) script that is supposed to randomly select a subset of sequences from a fastq-file. A fastq-file stores information in blocks of four rows each. The first row in each block starts with the character "@". The fastq file I use as my input file is 36 GB, containing about 14,000,000 lines.

I tried to rewrite an already existing script that used way too much memory, and I managed to reduce the memory usage a lot. But the script takes forever to run, and I don't see why.

Click to copy

parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()


def sample():
    linesamples = []
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    # make a list of the lines that are to be fetched from the fastq file:
    for i in seqsamples:
        linesamples.append(int(4*i+0))
        linesamples.append(int(4*i+1))
        linesamples.append(int(4*i+2))
        linesamples.append(int(4*i+3))
    # fetch lines from input file and write them to output file.
    for i, line in enumerate(infile):
        if i in linesamples:
            outputfile.write(line)

The grep-step takes practically no time at all, but after over 500 minutes, the script still hasn't started to write to the output file. So I suppose it is one of the steps in between grep and the last for-loop that takes such a long time. But I don't understand which step exactly, and what I can do to speed it up.

579

asked Dec 04 '14 22:12

Sandra

2 Answers

Depending on the size of linesamples, the if i in linesamples will take a long time since you are searching through a list for each iteration through infile. You could convert this into a set to improve the lookup time. Also, enumerate is not very efficient - I have replaced that with a line_num construct which we increment in each iteration.

Click to copy

def sample():
    linesamples = set()
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    for i in seqsamples:
        linesamples.add(int(4*i+0))
        linesamples.add(int(4*i+1))
        linesamples.add(int(4*i+2))
        linesamples.add(int(4*i+3))
    # make a list of the lines that are to be fetched from the fastq file:
    # fetch lines from input file and write them to output file.
    line_num = 0
    for line in infile:
        if line_num in linesamples:
            outputfile.write(line)
        line_num += 1
    outputfile.close()

answered Oct 18 '22 20:10

vikramls

You said that grep finishes running quite quickly, so in that case instead of just using grep to count the occurrences of @ have grep output the byte offsets of each @ character it sees (using the -b option for grep). Then, use random.sample to pick which ever blocks you want. Once you've chosen the byte offsets you want, use infile.seek to go to each byte offset and print out 4 lines from there.

answered Oct 18 '22 21:10

randomusername

Related questions
                            
                                Different storage position of equal strings with special characters [duplicate]
                            
                                Python NLP: TypeError: not all arguments converted during string formatting
                            
                                Specifying the output file name in Apache Spark
                            
                                Command Line Install of Mozilla Firefox for Selenium Scraping
                            
                                How to use set clipped path for Basemap polygon
                            
                                Project directory structure for a multi-language application
                            
                                pycharm - keep focus in editor when sending commands to python console
                            
                                Why is PySide's exception handling extending this object's lifetime?
                            
                                Does Reg-Free COM "Microsoft.Windows.ActCtx" work with scripting - late binding?
                            
                                DHT22 Sensor import Adafruit_DHT error
                            
                                Can I restore a function whose closure contains cycles in Python?
                            
                                How to serve a generated QR Image using python's qrcode on Flask
                            
                                Django Rest Framework, can I use ViewSet to generate a json from django view function?
                            
                                Is there a way to subclass a generator in Python 3?
                            
                                Using matplotlib on headless Ubuntu 14.04 Server
                            
                                Django is adding quotes to cookies with colons
                            
                                Python gcd calulation of rsa
                            
                                Images not copied to output folder in Pelican
                            
                                Update properties of a kivy widget while running code
                            
                                Flask, Python - Increase timeout length on Heroku app [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I make my Python script faster?

Tags:

performance

python

bioinformatics

fastq

Sandra

People also ask

2 Answers

vikramls

randomusername

Recent Activity

Donate For Us