I am facing a problem with python since a few days. I am a bioinformatics with no basic programming skills and I am working with huge text files (25gb approx.) that I have to process.
I have to read the txt file line-by-line at groups of 4lines per time, which means that the first 4 lines has to be read and processed and then I have to read the second group of 4 lines, and so on.
Obviously I cannot use the readlines() operator because it will overload my memory, and I have to use each of the 4 lines for some string recognition.
I thought about using a for cycle with the range operator:
openfile = open(path, 'r')
for elem in range(0, len(openfile), 4):
line1 = readline()
line2 = readline()
line3 = readline()
line4 = readline()
(process lines...)
Unfortunately this is not possibile because the file in "reading" mode cannot be iterated and treated like a list or a dictionary.
Can anybody please help to cycle this properly?
Thanks in advance
This has low memory overhead. It counts on the fact that a file is an iterator that reads by line.
def grouped(iterator, size):
yield tuple(next(iterator) for _ in range(size))
Use it like this:
for line1, line2, line3, line4 in grouped(your_open_file, size=4):
do_stuff_with_lines()
note: This code assumes that the file does not end with a partial group.
You're reading a fastq file, right? You're most probably reinventing the wheel - you could just use Biopython, it has tools for dealing with common biology file formats. For instance see this tutorial, for doing something with fastq files - it looks basically like this:
from Bio import SeqIO
for record in SeqIO.parse("SRR020192.fastq", "fastq"):
# do something with record, using record.seq, record.id etc
More on biopython SeqRecord objects here.
Here is another biopython fastq-processing tutorial, including a variant for doing this faster using a lower-level library, like this:
from Bio.SeqIO.QualityIO import FastqGeneralIterator
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")):
# do things with title,seq,qual values
There's also the HTSeq package, with more deep-sequencing-specific tools, which I actually use more often.
By the way, if you don't know about Biostar already, you could take a look - it's a StackExchange-format site specifically for bioinformatics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With