I need to combine lines in two files, basing in the condition, that in the line of one of the files is a part of the line of the second file.
A part of the first file:
12319000 -64,7357668067227 -0,1111052148685535 12319000 -79,68527661064425 -0,13231739777754026 12319000 -94,69642857142858 -0,15117839559513543 12319000 -109,59301470588237 -0,18277783185642743 12319001 99,70264355742297 0,48329515727315125 12319001 84,61113445378152 0,4060446341409862 12319001 69,7032037815126 0,29803063228455073 12319001 54,93886554621849 0,20958105041136763 12319001 39,937394957983194 0,13623056582981297 12319001 25,05574229691877 0,07748669438398018 12319001 9,99716386554622 0,028110643107892755
A part of the second file:
12319000.abf mutant 1 12319001.abf mutant 2 12319002.abf mutant 3
I need to create a file, where the line consists of this: all line from the first file and everything from the line of the second one. except for the filename in the first column.
As you can see, there are more, than one line in the first file, cooresponding to a line in the second one. I need that operation to be done with each line, so the output should be like this:
12319000 -94,69642857142858 -0,15117839559513543 mutant 1 12319000 -109,59301470588237 -0,18277783185642743 mutant 1 12319001 99,70264355742297 0,48329515727315125 mutant 2 12319001 84,61113445378152 0,4060446341409862 mutant 2
I've written this code:
oocytes = open(file_with_oocytes, 'r')
results = open(os.path.join(path, 'results.csv'), 'r')
results_new = open(os.path.join(path, 'results_with_oocytes.csv'), 'w')
for line in results:
for lines in oocytes:
if lines[0:7] in line:
print line + lines[12:]
But it prints out this, and nothing more, thow there are 45 line in the first file:
12319000 99,4952380952381 0,3011778623990699 mutant 1 12319000 99,4952380952381 0,3011778623990699 mutant 2 12319000 99,4952380952381 0,3011778623990699 mutant 3
What is wrong with the code? Or it should be done somehow completely differently?
File handles in Python have state; that is, they do not work like lists. You can repeatedly iterate over a list and get all the values out each time. Files, on the other hand, have a position from which the next read()
will occur. When you iterate over the file, you read()
each line. When you reach the last line, the file pointer is at the end of the file. A read()
from the end of the file returns the string ''
!
What you need to do is read in the oocytes
file once at the beginning, and store the values, maybe something like this:
oodict = {}
for line in oocytes:
oodict[line[0:7]] = line[12:]
for line in results:
results_key = line[0:7]
if results_key in oodict:
print oodict[results_key] + line
Note that this solution doesn't rely on the lengths of any field except for the length of the file extension in the second file.
# make a dict keyed on the filename before the extension
# with the other two fields as its value
file2dict = dict((row[0][:-4], row[1:])
for row in (line.split() for line in file2))
# then add to the end of each row
# the values to it's first column
output = [row + file2dict[row[0]] for row in (line.split() for line in file1)]
For testing purposes only, I used:
# I just use this to emulate a file object, as iterating over it yields lines
# just use file1 = open(whatever_the_filename_is_for_this_data)
# and the rest of the program is the same
file1 = """12319000 -64,7357668067227 -0,1111052148685535
12319000 -79,68527661064425 -0,13231739777754026
12319000 -94,69642857142858 -0,15117839559513543
12319000 -109,59301470588237 -0,18277783185642743
12319001 99,70264355742297 0,48329515727315125
12319001 84,61113445378152 0,4060446341409862
12319001 69,7032037815126 0,29803063228455073
12319001 54,93886554621849 0,20958105041136763
12319001 39,937394957983194 0,13623056582981297
12319001 25,05574229691877 0,07748669438398018
12319001 9,99716386554622 0,028110643107892755""".splitlines()
# again, use file2 = open(whatever_the_filename_is_for_this_data)
# and the rest of the program will work the same
file2 = """12319000.abf mutant 1
12319001.abf mutant 2
12319002.abf mutant 3""".splitlines()
where you should just use normal file objects. The output for the test data is :
[['12319000', '-64,7357668067227', '-0,1111052148685535', 'mutant', '1'],
['12319000', '-79,68527661064425', '-0,13231739777754026', 'mutant', '1'],
['12319000', '-94,69642857142858', '-0,15117839559513543', 'mutant', '1'],
['12319000', '-109,59301470588237', '-0,18277783185642743', 'mutant', '1'],
['12319001', '99,70264355742297', '0,48329515727315125', 'mutant', '2'],
['12319001', '84,61113445378152', '0,4060446341409862', 'mutant', '2'],
['12319001', '69,7032037815126', '0,29803063228455073', 'mutant', '2'],
['12319001', '54,93886554621849', '0,20958105041136763', 'mutant', '2'],
['12319001', '39,937394957983194', '0,13623056582981297', 'mutant', '2'],
['12319001', '25,05574229691877', '0,07748669438398018', 'mutant', '2'],
['12319001', '9,99716386554622', '0,028110643107892755', 'mutant', '2']]
well, simple things first, you printed the newline at the end of line - you would want to drop that with line[0:-1]
Next, "lines[0:7]" only tests the first 7 characters of the line - you wanted to test 8 chars. That's why the same value of "line" was printed out with 3 different mutant values.
Finally, you need to close and re-open oocytes for each line in results. Failure to do so ended your output after the first line of results.
Actually, the other answer is better - don't open and close oocytes for each line of results - open it and read it in (to a list) once, then iterate over that list for each line of results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With