Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid using readlines()?

I need to deal with super large txt input files, and I usually use .readlines() to first read the whole file, and turn it into a list.

I know it's really memory-cost and can be quite slow, but I also need to make use of LIST characteristics to manipulate the specific lines, like below:

#!/usr/bin/python

import os,sys
import glob
import commands
import gzip

path= '/home/xxx/scratch/'
fastqfiles1=glob.glob(path+'*_1.recal.fastq.gz')

for fastqfile1 in fastqfiles1:
    filename = os.path.basename(fastqfile1)
    job_id = filename.split('_')[0]
    fastqfile2 = os.path.join(path+job_id+'_2.recal.fastq.gz') 

    newfastq1 = os.path.join(path+job_id+'_1.fastq.gz') 
    newfastq2 = os.path.join(path+job_id+'_2.fastq.gz') 

    l1= gzip.open(fastqfile1,'r').readlines()
    l2= gzip.open(fastqfile2,'r').readlines()
    f1=[]
    f2=[]
    for i in range(0,len(l1)):
        if i % 4 == 3:
           b1=[ord(x) for x in l1[i]]
           ave1=sum(b1)/float(len(l1[i]))
           b2=[ord(x) for x in str(l2[i])]
           ave2=sum(b2)/float(len(l2[i]))
           if (ave1 >= 20 and ave2>= 20):
              f1.append(l1[i-3])
              f1.append(l1[i-2])
              f1.append(l1[i-1])
              f1.append(l1[i])
              f2.append(l2[i-3])
              f2.append(l2[i-2])
              f2.append(l2[i-1])
              f2.append(l2[i])
    output1=gzip.open(newfastq1,'w')
    output1.writelines(f1)
    output1.close()
    output2=gzip.open(newfastq2,'w')
    output2.writelines(f2)
    output2.close()

In general, I'm trying to read every 4th line of the whole text, but if the 4th line meets the desired condition, I'll append these 4 lines into the text. So can I avoid readlines() to achieve this? thx

EDIT: Hi, actually I myself found a better way:

import commands
 l1=commands.getoutput('zcat ' + fastqfile1).splitlines(True)
 l2=commands.getoutput('zcat ' + fastqfile2).splitlines(True)

I think 'zcat' is super fast.... It took around 15min to readlines, while only 1 minute to just zcat...

like image 679
LookIntoEast Avatar asked May 06 '26 11:05

LookIntoEast


2 Answers

If you can refactor your code to read through the file linearly, then you can just say for line in file to iterate through each line of the file without reading it all into memory at once. But, since your file access looks more complicated, you could use a generator to replace readlines(). One way to do this would be to use itertools.izip or itertools.izip_longest:

def four_at_a_time(iterable):
    """Returns an iterator that returns a 4-tuple of objects at a time from the
       given iterable"""
    args = [iter(iterable) * 4]
    return itertools.izip(*args)
...
l1 = four_at_a_time(gzip.open(fastqfile1, 'r'))
l2 = four_at_a_time(gzip.open(fastqfile2, 'r'))
for i, x in enumerate(itertools.izip(l1, l2))
    # x is now a 2-tuple of 4-tuples of lines (one 4-tuple of lines from the first file,
    # and one 4-tuple of lines from the second file).  Process accordingly.
like image 50
Adam Rosenfield Avatar answered May 08 '26 00:05

Adam Rosenfield


A simple way would be to,

(pseudocode, may contain errors, for illustrative purposes only)

    a=gzip.open()
    b=gzip.open()

    last_four_a_lines=[]
    last_four_b_lines=[]

    idx=0

    new_a=[]
    new_b=[]

    while True:
      la=a.readline()
      lb=b.readline()
      if (not la) or (not lb):
        break

      if idx % 4==3:
        a_calc=sum([ something ])/len(la)
        b_calc=sum([ something ])/len(lb)
        if a_calc and b_calc:
          for line in last_four_a_lines:
          new_a.append(line)
          for line in last_four_b_lines:
          new_b.append(line)

      last_four_a_lines.append(la)
      del(last_four_a_lines[0])
      last_four_b_lines.append(lb)
      del(last_four_b_lines[0])
      idx+=1
a.close()
b.close()
like image 20
Matt Warren Avatar answered May 08 '26 00:05

Matt Warren



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!