Performance problem with Euler problem and recursion on Int64 types

Question

I'm currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.

The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).

import Data.Int

arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
    arcLength' x y norm2 acc
        | x > y = acc
        | norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
        | norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
        | otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)

main = print $ arcLength (2^30)

Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.

public class ArcLength {
public static void main(String args[]) {
    long n = 1 << 30;
    long x = 0;
    long y = n-1;
    long acc = 0;
    long norm2 = 0;
    long time = System.currentTimeMillis();

    while(x <= y) {
        if (norm2 < 0) {
            norm2 += 2*x + 1;
            x++;
        } else if (norm2 > 2*(n-1)) {
            norm2 += 2 - 2*(x+y);
            x--;
            y--;
        } else {
            norm2 += 2*x + 1;
            x++;
            acc++;
        }
    }

    time = System.currentTimeMillis() - time;
    System.err.println(acc);
    System.err.println(time);
}

}

EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with

ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs

Here are the results:

(Int !norm2 !acc)
total time  =        3.00 secs   (150 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 !acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int64 norm2 acc)
arctest.exe: out of memory

(Int64 norm2 !acc)
total time  =       48.46 secs   (2423 ticks @ 20 ms)
total alloc = 26,246,173,228 bytes  (excludes profiling overheads)

(Int64 !norm2 !acc)
total time  =       31.46 secs   (1573 ticks @ 20 ms)
total alloc =       3,032 bytes  (excludes profiling overheads)

I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.

Don Stewart · Accepted Answer

There's a couple of interesting things in your question.

You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).

Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.

Make sure it is the same as the Java

One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.

import Data.Bits
import Data.Int

loop :: Int -> Int
loop n = go 0 (n-1) 0 0
    where
        go :: Int -> Int -> Int -> Int -> Int
        go x y acc norm2
            | x <= y        = case () of { _
                | norm2 < 0         -> go (x+1) y     acc     (norm2 + 2*x + 1)
                | norm2 > 2 * (n-1) -> go (x-1) (y-1) acc     (norm2 + 2 - 2 * (x+y))
                | otherwise         -> go (x+1) y     (acc+1) (norm2 + 2*x + 1)
            }
            | otherwise     = acc

main = print $ loop (1 `shiftL` 30)

Peek at the core

We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:

main_$s$wgo
  :: Int#
     -> Int#
     -> Int#
     -> Int#
     -> Int#

main_$s$wgo =
  \ (sc_sQa :: Int#)
    (sc1_sQb :: Int#)
    (sc2_sQc :: Int#)
    (sc3_sQd :: Int#) ->
    case <=# sc3_sQd sc2_sQc of _ {
      False -> sc1_sQb;
      True ->
        case <# sc_sQa 0 of _ {
          False ->
            case ># sc_sQa 2147483646 of _ {
              False ->
                main_$s$wgo
                  (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                  (+# sc1_sQb 1)
                  sc2_sQc
                      (+# sc3_sQd 1);
              True ->
                main_$s$wgo
                  (-#
                     (+# sc_sQa 2)
                     (*# 2 (+# sc3_sQd sc2_sQc)))
                  sc1_sQb
                  (-# sc2_sQc 1)
                  (-# sc3_sQd 1)
            };
          True ->
            main_$s$wgo
              (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
              sc1_sQb
              sc2_sQc
              (+# sc3_sQd 1)

that is, all unboxed into registers. That loop looks great!

And performs just fine (Linux/x86-64/GHC 7.03):

./A  5.95s user 0.01s system 99% cpu 5.980 total

Checking the asm

We get reasonable assembly too, as a nice loop:

Main_mainzuzdszdwgo_info:
        cmpq    %rdi, %r8
        jg      .L8
.L3:
        testq   %r14, %r14
        movq    %r14, %rdx
        js      .L4
        cmpq    $2147483646, %r14
        jle     .L9
.L5:
        leaq    (%rdi,%r8), %r10
        addq    $2, %rdx
        leaq    -1(%rdi), %rdi
        addq    %r10, %r10
        movq    %rdx, %r14
        leaq    -1(%r8), %r8
        subq    %r10, %r14
        jmp     Main_mainzuzdszdwgo_info
.L9:
        leaq    1(%r14,%r8,2), %r14
        addq    $1, %rsi
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info
.L8:
        movq    %rsi, %rbx
        jmp     *0(%rbp)
.L4:
        leaq    1(%r14,%r8,2), %r14
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info

Using the -fvia-C backend.

So this looks fine!

My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.

Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.

Lesson: use hardware 64 bits if at all possible.

monk · Answer

Hmm, this is interesting. So I just compiled both of your programs, and tried them out:

% java -version                                                                                          
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java                                                                                   
% java ArcLength                                                                                         
843298604
6630

So about 6.6 seconds for the Java solution. Next is ghc with some optimization:

% ghc --version                                                                                          
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc                                                                                             
843298604
./arc  12.68s user 0.04s system 99% cpu 12.718 total

Just under 13 seconds for ghc -O

Trying with some further optimization:

% ghc --make -O3
% time ./arc                                                                                             [13:16]
843298604
./arc  5.75s user 0.00s system 99% cpu 5.754 total

With further optimization flags, the haskell solution took under 6 seconds

It would be interesting to know what version compiler you are using.

Performance problem with Euler problem and recursion on Int64 types

Tags:

performance

windows

haskell

32bit-64bit

64-bit

dbergh

2 Answers

Don Stewart

monk

Recent Activity

Donate For Us

Performance problem with Euler problem and recursion on Int64 types

Tags:

performance

windows

haskell

32bit-64bit

64-bit

dbergh

2 Answers

Don Stewart

monk

Related questions

Recent Activity

Donate For Us