Analyzing slow performance of a Haskell program

Question

I was trying to solve ITA Software's "Word Nubmers" puzzle using a brute force approach. It looks like my Haskell version is more than 10 times slower than a C#/C++ version.

The answer

Thanks to Bryan O'Sullivan's answer, I was able to "correct" my program to acceptable performance. You can read his code which is much cleaner than mine. I am going to outline the key points here.

Int is Int64 on Linux GHC x64. Unless you unsafeCoerce, you should just use Int. This saves you from having to fromIntegral. Doing Int64 on Windows 32-bit GHC is just darn slow, avoid it. (This is in fact not GHC's fault. As mentioned in my blog post below, 64 bit integers in 32-bit programs is slow in general (at least in Windows))
-fllvm or -fvia-C for performance.
Prefer quotRem to divMod, quotRem already suffices. That gave me 20% speed up.
In general, prefer Data.Vector to Data.Array as an "array"
Use the wrapper-worker pattern liberally.

The above points were enough to give me about 100% boost over my original version.

In my blog post, I have detailed a step-by-step illustrated example of how I turned the original program to match Bryan's program. There are other points mentioned there as well.

The original question

(This may sound like a "could you do the work for me" post, but I argue that such a concrete example would be very instructive since profiling Haskell performance is often seen as a myth)

(As noted in the comments, I think I have misinterpreted the problem. But who cares, we can focus on performance in a different problem)

Here's a my version of a quick recap of the problem:

A wordNumber is defined as

wordNumber 1 = "one"
wordNumber 2 = "onetwo"
wordNumber 3 = "onethree"
wordNumber 15 = "onetwothreefourfivesixseveneightnineteneleventwelvethirteenfourteenfifteen"
...

Problem: Find the 51-billion-th letter of (wordNumber Infinity); assume that letter is found at 'wordNumber x', also find 'sum [1..x]'

From an imperative perspective, a naive algorithm would be to have 2 counters, one for sum of numbers and one for sum of lengths. Keep counting the length of each wordNumber and "break" to return the result.

The imperative brute-force approach is implemented in C# here: http://ideone.com/JjCb3. It takes about 1.5 minutes to find the answer on my computer. There is also an C++ implementation that runs in 45 seconds on my computer.

Then I implemented a brute-force Haskell version: http://ideone.com/ngfFq. It cannot finish the calculation in 5 minutes on my machine. (Irony: it's has more lines than the C# version)

Here is the -p profile of the Haskell program: http://hpaste.org/49934

Question: How to make it perform comparatively to the C# version? Are there obvious mistakes I am making?

(Note: I am fully aware that brute-forcing it is not the correct solution to this problem. I am mainly interested in making the Haskell version perform comparatively to the C# version. Right now it is at least 5x slower so obviously I am missing something obvious)

(Note 2: It does not seem to be space leaking. The program runs with constant memory (about 2MB) on my computer)

(Note 3: I am compiling with `ghc -O2 WordNumber.hs)

To make the question more reader friendly, I include the "gist" of the two versions.

// C#
long sumNum = 0;
long sumLen = 0;

long target = 51000000000;
long i = 1;

for (; i < 999999999; i++)
{
    // WordiLength(1)   = 3   "one"
    // WordiLength(101) = 13  "onehundredone"
    long newLength = sumLen + WordiLength(i);
    if (newLength >= target)
        break;

    sumNum += i;
    sumLen = newLength;
}
Console.WriteLine(Wordify(i)[Convert.ToInt32(target - sumLen - 1)]);

-

-- Haskell
-- This has become totally ugly during my squeeze for 
-- performance

-- Tail recursive
-- n-th number (51000000000 in our problem) -> accumulated result -> list of 'zipped' left to try
-- accumulated has the format (sum of numbers, current lengths of the whole chain, the current number)
solve :: Int64 -> (Int64, Int64, Int64) -> [(Int64, Int64)] -> (Int64, Int64, Int64)
solve !n !acc@(!sumNum, !sumLen, !curr) ((!num, !len):xs)
    | sumLen' >= n = (sumNum', sumLen, num)
    | otherwise = solve n (sumNum', sumLen', num) xs
    where
        sumNum' = sumNum + num
        sumLen' = sumLen + len

-- wordLength 1   = 3    "one"
-- wordLength 101 = 13   "onehundredone"
wordLength :: Int64 -> Int64
-- wordLength = ...

solution :: Int64 -> (Int64, Char)
solution !x =
    let (sumNum, sumLen, n) = solve x (0,0,1) (map (
 -> (n, wordLength n)) [1..])
    in (sumNum, (wordify n) !! (fromIntegral $ x - sumLen - 1))

Bryan O'Sullivan · Accepted Answer

I've written a gist that contains both a C++ version (a copy of yours from a Haskell-cafe message, with a bug fixed) and a Haskell translation.

Notice that the two are structurally almost identical. When compiled with -fllvm, the Haskell code runs at about half the speed of the C++ code, which is pretty good.

Now let's compare my Haskell wordLength code to yours. You're passing around an extra unnecessary parameter, which is unnecessary (you apparently figured that out when writing the C++ code that I translated). Also, the large number of bang patterns suggests panic; they're almost all useless.

Your solve function is also very confused.

You're passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.
This function is necessarily not very regular in its behaviour, so while you gain nothing stylistically by using a list to supply your counter, you probably force GHC to allocate memory. In other words, this both obfuscates the code and makes it slower.
By using a tuple for three parameters (for no obvious reason), you're again working hard to force GHC to allocate memory for every step through the loop, when it could avoid doing so if you passed the parameters directly.
Only your n parameter is dealt with in a sensible way, but you don't need a bang pattern on it.
The only parameter that needs a bang pattern is sumNum, because you never inspect its value until after the loop has finished. GHC's strictness analyser will deal with the others. All of your other bang patterns are unnecessary at best, misdirections at worst.

Peter Wortmann · Answer

Here are two pointers I could come up with in a quick investigation:

Note that using Int64 is really slow when you are using a 32 bit build of GHC, as is the default for Haskell Platform, currently. This also turned out to be the main villain in a previous performance problem (there I give a few more details).
For reasons I don't quite understand the divMod function does not seem to get inlined. As a result, the numbers are returned on the heap. When using div and mod separately, wordLength' executes purely on the stack as it should be.

Sadly I currently have no 64-bit GHC around to test whether this is enough to solve the problem.

Analyzing slow performance of a Haskell program

Tags:

performance

profiling

haskell

The answer

The original question

Question: How to make it perform comparatively to the C# version? Are there obvious mistakes I am making?

kizzx2

2 Answers

Bryan O'Sullivan

Peter Wortmann

Recent Activity

Donate For Us

Analyzing slow performance of a Haskell program

Tags:

performance

profiling

haskell

The answer

The original question

Question: How to make it perform comparatively to the C# version? Are there obvious mistakes I am making?

kizzx2

2 Answers

Bryan O'Sullivan

Peter Wortmann

Related questions

Recent Activity

Donate For Us