I have a script which takes a lot of time and can't finish so far after 2 days... I parsed 1 file into 2 dictionaries as the following:
gfftree = {'chr1':[(gene_id, gstart, gend),...], 'chr2':[(gene_id, gstart, gend),...],...}
TElocation = {'chr1':[(TE_id, TEstart, TEend),...], 'chr2':[(TE_id, TEstart, TEend),...],...}
.
--The aim is to find TE_id whose TEstart or TEend or both are located between gene_id' gstart and gend in each chr(key).
Here is my code:
TE_in_TSS = []
for TErange in TElocation[chromosome]:
TE_id, TEstart, TEend = TErange
for item in gfftree[chromosome]:
gene, gstart, gend = item
if len(list(set(range(int(gstart),int(gend)+1)) & set(range(int(TEstart),int(TEend)+1)))) > 0:
TE_in_TSS.append((gene, TE_id, TEstart, TEend))
else:
pass
So far I'm sure this loop is fine with small data, but when it comes to bigger one like 800,000 TE_id and 4,000 gene_id, it takes time...and I don't know if it could finish...
The OP approach is O(n*m)
, where n
is the number of genes and m
is the number of TEs. Rather than test each gene against each TE as in the OP, this approach leverages the ordered nature of the genes and TEs, and the specified rules of matching, to look at each gene and TE only once, except for the gene look-ahead described in 3.
below. This approach is O(n + m)
provided that the average gene look-ahead is small relative to n
. The sequence in which each gene and TE is visited is described by:
def get_TE_in_TSS(genes, TEs):
TE_in_TSS = []
gene_pos, TE_pos = 0, 0
gene_count, TE_count = len(genes), len(TEs)
while gene_pos < gene_count:
while (TE_pos < TE_count) and (TEs[TE_pos][1] <= genes[gene_pos][2]):
match_gene_pos = gene_pos
while (match_gene_pos < gene_count) and (TEs[TE_pos][2] >= genes[match_gene_pos][1]):
TE_in_TSS.append((genes[match_gene_pos][0], TEs[TE_pos][0],
TEs[TE_pos][1], TEs[TE_pos][2]))
match_gene_pos += 1 # look ahead to see if this TE matches the next gene
TE_pos += 1
gene_pos += 1
return TE_in_TSS
performance, as reported by OP:
1 second (compared to 2 days + for OP code) for 801,948 TEs, 6,007 genes
test data:
genes = (('HTR3A', 7, 9), ('ADAMTSL4', 10,100), ('THSD4',2000, 2800), ('PAPLN', 2850, 3000))
TEs = (('a', 10, 11), ('b', 13, 17), ('c', 50, 2500), ('d', 2550, 2700),
('e', 2800, 2900), ('f', 9999, 9999))
TE_in_TSS = get_TE_in_TSS(genes, TEs)
print(TE_in_TSS)
Output:
[('ADAMTSL4', 'a', 10, 11), ('ADAMTSL4', 'b', 13, 17), ('ADAMTSL4', 'c', 50, 2500),
('THSD4', 'c', 50, 2500), ('THSD4', 'd', 2550, 2700), ('THSD4', 'e', 2800, 2900),
('PAPLN', 'e', 2800, 2900)]
Note that the first 9 comments on this post refer to a more efficient O(n * m)
approach that became outdated by clarified specs.
Here is a solution using multi-threading, comparing code used for nested loop methods.
I created two csv's, one with 8k rows and one 800 rows of (int, float1,float2) random generated numbers and import as below:
import time
import itertools
start = time.time()
def f((TE_id, TEstart, TEend)):
a=[]
for gene, gstart, gend in gfftree['chr1']:
if (gstart <= TEstart <=gend) or (gstart<=TEend <=gend):
a.append((gene,TE_id,TEstart,TEend))
return a
'''
#slow
TEinTSS = []
for TE_id, TEstart, TEend in TElocation['chr1']:
for gene, gstart, gend in gfftree['chr1']:
if (gstart <= TEstart <=gend) or (gstart<=TEend <=gend):
TEinTSS.append((gene,TE_id,TEstart,TEend))
print len(TEinTSS)
print time.time()-start
#faster
TEinTSS = []
for things in TElocation['chr1']:
TEinTSS.extend(f(things))
print len(TEinTSS)
print time.time()-start
'''
#fastest (especially with multi-core, multithreading)
from multiprocessing import Pool
if __name__ == '__main__':
p=Pool()
TEinTSS = list(itertools.chain.from_iterable(p.imap_unordered(f, b)))
print len(TEinTSS)
print time.time() - start
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With