I'm implementing a motif finding algorithm from the domain of bioinformatics using Haskell. I wont go into the details of the algorithm other then to say it's branch and bound median string search. I had planned on making my implementation more interesting by implementing a concurrent approach (and later an STM approach) in order to get a multicore speed up but after compiling with the follow flags
$ ghc -prof -auto-all -O2 -fllvm -threaded -rtsopts --make main
and printing the profile I saw something interesting (and perhaps obvious):
COST CENTRE entries %time %alloc
hammingDistance 34677951 47.6 14.7
motifs 4835446 43.8 71.1
It's clear that a remarkable speedup could be gained without going anywhere near multicore programming (although that's been done and I just need to find some good test data and sort out Criterion for that).
Anyway, both of these functions are purely functional and in no way concurrent. They're also doing quite simple stuff, so I was surprised that they took so much time. Here's the code for them:
data NukeTide = A | T | C | G deriving (Read, Show, Eq, Ord, Enum)
type Motif = [NukeTide]
hammingDistance :: Motif -> Motif -> Int
hammingDistance [] [] = 0
hammingDistance xs [] = 0 -- optimistic
hammingDistance [] ys = 0 -- optimistic
hammingDistance (x:xs) (y:ys) = case (x == y) of
True -> hammingDistance xs ys
False -> 1 + hammingDistance xs ys
motifs :: Int -> [a] -> [[a]]
motifs n nukeTides = [ take n $ drop k nukeTides | k <- [0..length nukeTides - n] ]
Note that of the two arguments to hammingDistance, I can actually assume that xs is going to be x long and that ys is going to be less than or equal to that, if that opens up room for improvements.
As you can see, hammingDistance calculates the hamming distance between two motifs, which are lists of nucleotides. The motifs function takes a number and a list and returns all the sub strings of that length, e.g.:
> motifs 3 "hello world"
["hel","ell","llo","lo ","o w"," wo","wor","orl","rld"]
Since the algorithmic processes involved are so simple I can't think of a way to optimize this further. I do however have two guesses as to where I should be headed:
Does anybody have any advice on the usual procedure here? If data types are the problem, would arrays be the right answer? (I've heard they come in boxes)
Thanks for the help.
Edit: It just occurred to me that it might be useful if I describe the manner in which these two functions are called:
totalDistance :: Motif -> Int
totalDistance motif = sum $ map (minimum . map (hammingDistance motif) . motifs l) dna
This function is the result of another function, and is passed around nodes in a tree. At each node in the tree an evaluation of the nucleotide (of length <= n, that is if == n then it is a leaf node) is done, using totalDistance to score the node. From then on it's your typical branch and bound algorithm.
Edit: John asked that I print out the change I made which virutally eliminated the cost of motifs:
scoreFunction :: DNA -> Int -> (Motif -> Int)
scoreFunction dna l = totalDistance
where
-- The sum of the minimum hamming distance in each line of dna
-- is given by totalDistance motif
totalDistance motif = sum $ map (minimum . map (hammingDistance motif)) possibleMotifs
possibleMotifs = map (motifs l) dna -- Previously this was computed in the line above
I didn't make it clear in my original post, but scoreFunction is only called once, and the result is passed around in a tree traversal/branch and bound and used to evaluate nodes. Recomputing motifs at every step of the way, in retrospect, isn't one of the brightest things I've done.
Your definition of motifs
looks like it's doing a lot more traversing than necessary because each application of drop
has to traverse the list from the beginning. I would implement it using Data.List.tails
instead:
motifs2 :: Int -> [a] -> [[a]]
motifs2 n nukeTides = map (take n) $ take count $ tails nukeTides
where count = length nukeTides - n + 1
A quick comparison in GHCi shows the difference (using sum . map length
to force evaluation):
*Main> let xs = concat (replicate 10000 [A, T, C, G])
(0.06 secs, 17914912 bytes)
*Main> sum . map length $ motifs 5 xs
199980
(3.47 secs, 56561208 bytes)
*Main> sum . map length $ motifs2 5 xs
199980
(0.15 secs, 47978952 bytes)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With