I am using Python's bitarray module to convert a DNA sequence, that is written in a binary file, to its reverse complement. Each nucleotide is represented by two bits in the following format: A - 00, C - 01, G - 10, T - 11.
For example, the reverse complement of AGCTACGG (00 10 01 11 00 01 10 10) would be CCGTAGCT (01 01 10 11 00 10 01 11).
This sequence takes up exactly 16 bits (2 bytes), but a sequence of length 9 would take 18 bits and it is padded to take up 24 bits (3 bytes).
At the moment I use a for cycle for the conversion, but this solution is dreadfully slow.
def reverse_complement( my_bitarray, seq_length ):
for i in range(0, 2 * seq_length - 1, 2):
if my_bitarray[i] == my_bitarray[i + 1]:
if my_bitarray[i] == 0:
my_bitarray[i], my_bitarray[i + 1] = 1, 1
else:
my_bitarray[i], my_bitarray[i + 1] = 0, 0
#padding if the bitarray is not a multiple of 8 bits in length
if seq_length / 4 != int():
my_bitarray.reverse()
my_bitarray.fill()
my_bitarray.reverse()
return my_bitarray
a = bitarray()
a.frombytes(seq[::-1])
b = a[int(seq_start)::] # seq without padding
b.reverse()
reverse_complement(b, seq_length)
Any tips on how to make this process faster?
The code you provided doesn't give the answer you indicated.
Here is code that gives the correct answer. Perhaps it will also be fast enough:
def reverse_complement(my_bitarray):
# First reverse by twos
my_bitarray = zip(my_bitarray[0::2], my_bitarray[1::2])
my_bitarray = reversed(list(my_bitarray))
my_bitarray = (i for t in my_bitarray for i in t)
my_bitarray = bitarray(my_bitarray)
# Then complement
my_bitarray.invert()
return my_bitarray
Note that you don't have to worry about the padding. bitarray.bitarray() manages all of that for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With