Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read two files in parallel, line by line, in python?

I've been trying to solve this issue all day without success.

I have an 'original file', let's call it 'infile', which is the file I want to edit. Additionaly I have another file that functions as a 'dictionary', let's call it 'inlist'.

Here are examples of the infile:

PRMT6   10505   Q96LA8  HMGA1   02829   NP_665906
WDR77   14387   NP_077007   SNRPE   00548   NP_003085
NCOA3   03570   NP_858045   RELA    01241   NP_068810
ITCH    07565   Q96J02  DTX1    03991   NP_004407

And the inlist:

NP_060607   Q96LA8
NP_001244066    Q96J02
NP_077007   Q9BQA1
NP_858045   Q9Y6Q9

My current approach consists in splitting the lines in the respective columns, splitting the lines by the existing tabs. The objective is to read each line of the infile and check some stuff:

  1. If the element in the 3rd column of the infile is found in the 1st column of the inlist, change that element for the respective one in the inlist 2nd column
  2. If the element in the 3rd column of the infile is found in the 2nd column of the inlist, do nothing
  3. Same thing for the 5th column of the infile

This should retrieve the output:

PRMT6   10505   Q96LA8  HMGA1   02829   Q(...)
WDR77   14387   Q9BQA1  SNRPE   00548   Q(...)
NCOA3   03570   Q9Y6Q9  RELA    01241   Q(...)
ITCH    07565   Q96J02  DTX1    03991   Q(...)

NOTE: not all codes start with Q

I've tried using a while loop, but wasn't successful and I'm to ashamed to post the code here (I'm new to programming, so I don't want to get demotivated so early in the 'game'). Something that would be perfect to solve this would be:

for line in inlist #, infile: <--- THIS PART! Reading both files, splitting both files, replacing both files...
        inlistcolumns = line.split('\t')
        infilecolumns = line.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]) + "\n")
        elif inlistcolumns[0] in infilecolumns[5]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]) + "\n")
        else:
            outfile.write('\t'.join(infilecolumns) + '\n')

Help would be much appreciated. Thanks!

Ok, after the hints of Sephallia and Jlengrand I got this:

for line in infile:
    try:
    # Read lines in the dictionary
        line2 = inlist.readline()
        inlistcolumns = line.split('\t')
        infilecolumns = line.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]))
        elif inlistcolumns[0] in infilecolumns[5]:
                outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]))
        else:
                    outfile.write('\t'.join(infilecolumns))
    except IndexError:
        print "End of dictionary reached. Restarting from top."

The problem is that apparently the if statements are not doing their job, as the output file remained equal to the input file. What can I be doing wrong?

Edit 2:

As asked by some, here goes the full code:

    import os

def replace(infilename, linename, outfilename):
    # Open original file and output file
    infile = open(infilename, 'rt')
    inlist = open(linename, 'rt')
    outfile = open(outfilename, 'wt')

    # Read lines and find those to be replaced
    for line in infile:
        infilecolumns = line.split('\t')
        line2 = inlist.readline()
        inlistcolumns = line2.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]))
        elif inlistcolumns[0] in infilecolumns[5]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]))
        outfile.write('\t'.join(infilecolumns))

    # Close files
    infile.close()
    inlist.close()
    outfile.close()


if __name__ == '__main__':
    wdir = os.getcwd()
    outdir = os.path.join(wdir, 'results.txt')
    outname = os.path.basename(outdir)
    original = raw_input("Type the name of the file to be parsed\n")
    inputlist = raw_input("Type the name of the libary to be used\n")
    linesdir = os.path.join(wdir, inputlist)
    linesname = os.path.basename(linesdir)
    indir = os.path.join(wdir, original)
    inname = os.path.basename(indir)

    replace(indir, linesdir, outdir)

    print "Successfully applied changes.\nOriginal: %s\nLibrary: %s\nOutput:%s" % (inname, linesname, outname)

The first file to be used is hprdtotal.txt: https://www.dropbox.com/s/hohvlcdqvziewte/hprdmap.txt And the second is hprdmap.txt: https://www.dropbox.com/s/9hd0e3a8rt95pao/hprdtotal.txt

Hope this helps.

like image 616
Edward Coelho Avatar asked Jan 27 '26 08:01

Edward Coelho


2 Answers

Woudln't something like that simply work ?

(following your snippet)

for line in infile: # read file 1 one line after the other
        try
            line2 = inlist.readline() # read a line of file 2
        catch Exception:
            print "End of file 2 reached"
        inlistcolumns = line.split('\t')
        infilecolumns = line.split('\t')
        if inlistcolumns[0] in infilecolumns[2]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]) + "\n")
        elif inlistcolumns[0] in infilecolumns[5]:
            outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]) + "\n")
        else:
            outfile.write('\t'.join(infilecolumns) + '\n')

I really don't get why not saving your file in memory first though, and then do a simple pattern research . I there a proper reason for you to read both files at the same time ? (does line 45 of file 1 match with line 45 of file 2 ? )

like image 164
jlengrand Avatar answered Jan 29 '26 20:01

jlengrand


I would suggest loading inlist into memory as a lookup table - which is a dict in Python and looping over infile and use the lookup table to decide if you wish to replace.

I'm not 100% sure I've got your logic correct here, but it's a base you can build on.

import csv

lookup = {}
uniq2nd = set()
with open('inlist') as f:
    tabin = csv.reader(f, delimiter='\t')
    for c1, c2 in tabin:
        lookup[c1] = c2
        uniq2nd.add(c2)

with open('infile') as f, open('outfile', 'wb') as fout:
    tabin = csv.reader(f, delimiter='\t')
    tabout = csv.writer(fout, delimiter='\t')
    for row in csv.reader(tabin):
        if row[2] not in uniq2nd: # do nothing if col2 of inlist
            row[2] = lookup.get(row[2], row[2]) # replace or keep same
        # etc...
    csvout.writerow(row)
like image 20
Jon Clements Avatar answered Jan 29 '26 20:01

Jon Clements