I have two tab-delimited files, and I need to test every row in the first file against all the rows in the other file. For instance,
file1:
row1 c1 36 345 A row2 c3 36 9949 B row3 c4 36 858 C
file2:
row1 c1 3455 3800 row2 c3 6784 7843 row3 c3 10564 99302 row4 c5 1405 1563
let's say I would like to output all the rows in (file1) for which col[3] of file1 is smaller than any (not every) col[2] of file2, given that col[1] are the same.
Expected output:
row1 c1 36 345 A row2 c3 36 9949 B
Since I am working in Ubuntu, I would like the input command to look like this:python code.py [file1] [file2] > [output]
I wrote the following code:
import sys filename1 = sys.argv[1] filename2 = sys.argv[2] file1 = open(filename1, 'r') file2 = open(filename2, 'r') done = False for x in file1.readlines(): col = x.strip().split() for y in file2.readlines(): col2 = y.strip().split() if col[1] == col2[1] and col[3] < col2[2]: done = True break else: continue print x
However, the output looks like this:
row2 c3 36 9949 B
This is evident for larger datasets, but basically I always get only the last row for which the condition in the nested loop was true. I am suspecting that "break" is breaking me out of both loops. I would like to know (1) how to break out of only one of the for loops, and (2) if this is the only problem I've got here.
Using break in a nested loop In a nested loop, a break statement only stops the loop it is placed in. Therefore, if a break is placed in the inner loop, the outer loop still continues. However, if the break is placed in the outer loop, all of the looping stops.
BREAK will only break out of the loop in which it was called. As a workaround, you can use a flag variable along with BREAK to break out of nested loops.
You can avoid nested loops with itertools. product() . You can use itertools. product() to get all combinations of multiple lists in one loop and get the same result as nested loops.
break
and continue
apply to the innermost loop.
The issue is that you open the second file only once, and therefore it's only read once. When you execute for y in file2.readlines():
for the second time, file2.readlines()
returns an empty iterable.
Either move file2 = open(filename2, 'r')
into the outer loop, or use seek()
to rewind to the beginning of file2
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With