I've been trying to solve this issue all day without success.
I have an 'original file', let's call it 'infile', which is the file I want to edit. Additionaly I have another file that functions as a 'dictionary', let's call it 'inlist'.
Here are examples of the infile:
PRMT6 10505 Q96LA8 HMGA1 02829 NP_665906
WDR77 14387 NP_077007 SNRPE 00548 NP_003085
NCOA3 03570 NP_858045 RELA 01241 NP_068810
ITCH 07565 Q96J02 DTX1 03991 NP_004407
And the inlist:
NP_060607 Q96LA8
NP_001244066 Q96J02
NP_077007 Q9BQA1
NP_858045 Q9Y6Q9
My current approach consists in splitting the lines in the respective columns, splitting the lines by the existing tabs. The objective is to read each line of the infile and check some stuff:
This should retrieve the output:
PRMT6 10505 Q96LA8 HMGA1 02829 Q(...)
WDR77 14387 Q9BQA1 SNRPE 00548 Q(...)
NCOA3 03570 Q9Y6Q9 RELA 01241 Q(...)
ITCH 07565 Q96J02 DTX1 03991 Q(...)
NOTE: not all codes start with Q
I've tried using a while loop, but wasn't successful and I'm to ashamed to post the code here (I'm new to programming, so I don't want to get demotivated so early in the 'game'). Something that would be perfect to solve this would be:
for line in inlist #, infile: <--- THIS PART! Reading both files, splitting both files, replacing both files...
inlistcolumns = line.split('\t')
infilecolumns = line.split('\t')
if inlistcolumns[0] in infilecolumns[2]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]) + "\n")
elif inlistcolumns[0] in infilecolumns[5]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]) + "\n")
else:
outfile.write('\t'.join(infilecolumns) + '\n')
Help would be much appreciated. Thanks!
Ok, after the hints of Sephallia and Jlengrand I got this:
for line in infile:
try:
# Read lines in the dictionary
line2 = inlist.readline()
inlistcolumns = line.split('\t')
infilecolumns = line.split('\t')
if inlistcolumns[0] in infilecolumns[2]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]))
elif inlistcolumns[0] in infilecolumns[5]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]))
else:
outfile.write('\t'.join(infilecolumns))
except IndexError:
print "End of dictionary reached. Restarting from top."
The problem is that apparently the if statements are not doing their job, as the output file remained equal to the input file. What can I be doing wrong?
Edit 2:
As asked by some, here goes the full code:
import os
def replace(infilename, linename, outfilename):
# Open original file and output file
infile = open(infilename, 'rt')
inlist = open(linename, 'rt')
outfile = open(outfilename, 'wt')
# Read lines and find those to be replaced
for line in infile:
infilecolumns = line.split('\t')
line2 = inlist.readline()
inlistcolumns = line2.split('\t')
if inlistcolumns[0] in infilecolumns[2]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]))
elif inlistcolumns[0] in infilecolumns[5]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]))
outfile.write('\t'.join(infilecolumns))
# Close files
infile.close()
inlist.close()
outfile.close()
if __name__ == '__main__':
wdir = os.getcwd()
outdir = os.path.join(wdir, 'results.txt')
outname = os.path.basename(outdir)
original = raw_input("Type the name of the file to be parsed\n")
inputlist = raw_input("Type the name of the libary to be used\n")
linesdir = os.path.join(wdir, inputlist)
linesname = os.path.basename(linesdir)
indir = os.path.join(wdir, original)
inname = os.path.basename(indir)
replace(indir, linesdir, outdir)
print "Successfully applied changes.\nOriginal: %s\nLibrary: %s\nOutput:%s" % (inname, linesname, outname)
The first file to be used is hprdtotal.txt: https://www.dropbox.com/s/hohvlcdqvziewte/hprdmap.txt And the second is hprdmap.txt: https://www.dropbox.com/s/9hd0e3a8rt95pao/hprdtotal.txt
Hope this helps.
Woudln't something like that simply work ?
(following your snippet)
for line in infile: # read file 1 one line after the other
try
line2 = inlist.readline() # read a line of file 2
catch Exception:
print "End of file 2 reached"
inlistcolumns = line.split('\t')
infilecolumns = line.split('\t')
if inlistcolumns[0] in infilecolumns[2]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(inlistcolumns[1]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(infilecolumns[5]) + "\n")
elif inlistcolumns[0] in infilecolumns[5]:
outfile.write(str(infilecolumns[0]) + "\t" + str(infilecolumns[1]) + "\t" + str(infilecolumns[2]) + "\t" + str(infilecolumns[3]) + "\t" + str(infilecolumns[4]) + "\t" + str(inlistcolumns[1]) + "\n")
else:
outfile.write('\t'.join(infilecolumns) + '\n')
I really don't get why not saving your file in memory first though, and then do a simple pattern research . I there a proper reason for you to read both files at the same time ? (does line 45 of file 1 match with line 45 of file 2 ? )
I would suggest loading inlist into memory as a lookup table - which is a dict in Python and looping over infile and use the lookup table to decide if you wish to replace.
I'm not 100% sure I've got your logic correct here, but it's a base you can build on.
import csv
lookup = {}
uniq2nd = set()
with open('inlist') as f:
tabin = csv.reader(f, delimiter='\t')
for c1, c2 in tabin:
lookup[c1] = c2
uniq2nd.add(c2)
with open('infile') as f, open('outfile', 'wb') as fout:
tabin = csv.reader(f, delimiter='\t')
tabout = csv.writer(fout, delimiter='\t')
for row in csv.reader(tabin):
if row[2] not in uniq2nd: # do nothing if col2 of inlist
row[2] = lookup.get(row[2], row[2]) # replace or keep same
# etc...
csvout.writerow(row)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With