I am have a relatively simple problem: given a position in the genome, return the name of the gene at that point.
The way I am solving this problem right now is using the following class in cython::
class BedFile():
""" A Lookup Object """
def __init__(self, bedfile):
self.dict = {}
cdef int start, end
with open(bedfile) as infile:
for line in infile:
f = line.rstrip().split('\t')
if len(f) < 4:
continue
chr = f[0]
start = int(f[1])
end = int(f[2])
gene = f[3]
if chr not in self.dict:
self.dict[chr] = {}
self.dict[chr][gene] = (start, end)
def lookup(self, chromosome, location):
""" Lookup your gene. Returns the gene name """
cdef int l
l = int(location)
answer = ''
for k, v in self.dict[chromosome].items():
if v[0] < l < v[1]:
answer = k
break
if answer:
return answer
else:
return None
The full project is here: https://github.com/MikeDacre/python_bed_lookup, although the entire relevant class is above.
The issue with the code as is is that the resulting class/dictionary take up a very large amount of memory for the human genome, with 110 million genes (that's a 110 million line long text file). I killed the init function in the process of building the dictionary after about two minutes, when it hit 16GB of memory. Anything that uses that much memory is basically useless.
I am sure there must me a more efficient way of doing this, but I am not a C programmer, and I am very new to cython. My guess is that I could build a pure C structure of some kind to hold the gene name and the start and end values. Then I could convert lookup() into a function that calls another cdef function called _lookup(), and use that cdef function to do that actual query.
In an ideal world, the whole structure could live in memory and take up less than 2GB of memory for ~2,000,000 entries (each entry with two ints and a string).
Edit: I figured out how to do this efficiently with sqlite for large file, to see the complete code with sqlite see here: https://github.com/MikeDacre/python_bed_lookup
However, I still think that the class above can be optimized with cython to make the dictionary smaller in memory and lookups faster, any suggestions are appreciated.
Thanks!
To expand on my suggestion in the comments a bit, instead of storing (start,end) as a tuple, store it as a simple Cython-defined class:
cdef class StartEnd:
cdef public int start, end
def __init__(self, start, end):
self.start = start
self.end = end
(you could also play with changing the integer type for more size savings). I'm not recommending getting rid of the Python dictionaries because they're easy to use, and (I believe) optimised to be reasonably efficient for the (common in Python) case of string keys.
We can estimate the rough size savings by using sys.getsizeof. (Be aware that this will work well for built-in classes and Cython classes, but not so well for Python classes so don't trust it too far. Also be aware that the results are platform dependent so yours may differ slightly).
>>> sys.getsizeof((1,2)) # tuple
64
>>> sys.getsizeof(1) # Python int
28
(therefore each tuple contains 64+28+28=120 bytes)
>>> sys.getsizeof(StartEnd(1,2)) # my custom class
24
(24 makes sense: it's the PyObject_Head (16 bytes: a 64bit integer and a pointer) + 2 32-bit integers).
Therefore, 5 times smaller, which is a good start I think.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With