Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency in Haskell when counting primes

I have the following set of functions to count the number of primes less than or equal to a number n in Haskell.

The algorithm takes a number, checks if it is divisible by two and then checks if it divisible by odd numbers up to the square root of the number being checked.

-- is a numner, n, prime? 
isPrime :: Int -> Bool
isPrime n = n > 1 &&
              foldr (\d r -> d * d > n || (n `rem` d /= 0 && r))
                True divisors

-- list of divisors for which to test primality
divisors :: [Int]
divisors = 2:[3,5..]

-- pi(n) - the prime counting function, the number of prime numbers <= n
primesNo :: Int -> Int
primesNo 2 = 1
primesNo n
    | isPrime n = 1 + primesNo (n-1)
    | otherwise = 0 + primesNo (n-1)

main = print $ primesNo (2^22)

Using GHC with the -O2 optimisation flag, counting the number of primes for n = 2^22 takes ~3.8sec on my system. The following C code take ~ 0.8 sec:

#include <stdio.h>
#include <math.h>

/*
    compile with: gcc -std=c11 -lm -O2 c_primes.c -o c_orig
*/

int isPrime(int n) {
    if (n < 2)
        return 0;
    else if (n == 2)
        return 1;
    else if (n % 2 == 0)
        return 0;
    int uL = sqrt(n);
    int i = 3;
    while (i <= uL) {
        if (n % i == 0)
            return 0;
        i+=2;
    }
    return 1;
}

int main() {
    int noPrimes = 0, limit = 4194304;
    for (int n = 0; n <= limit; n++) {
        if (isPrime(n))
            noPrimes++;
    }
    printf("Number of primes in the interval [0,%d]: %d\n", limit, noPrimes);
    return 0;
}

This algorithm take about 0.9 sec in Java and 1.8 sec in JavaScript (on Node) so it just feels that the Haskell version is slower than I would expect be. Is there anyway I can more efficiently code this in Haskell without changing the algorithm?


EDIT

The following version of isPrime offered by @dfeuer shaves one second off the running time taking it down to 2.8sec (down from 3.8). Though this is still slower than JavaScript (Node) which takes approx 1.8 sec as shown here, Yet Another Language Speed Test.

isPrime :: Int -> Bool
isPrime n
  | n <= 2 = n == 2
  | otherwise = odd n && go 3
  where
    go factor
      | factor * factor > n = True
      | otherwise = n `rem` factor /= 0 && go (factor+2) 

EDIT

In the above isPrime function, the function go calls factor * factor for each divisor for a single n. I would imagine that it would be more efficient to compare factor to the square root of n as this would only have to be calculated once per n. However, using the following code, computation time is increased by approximately 10%, is the square root of n being re-calculated every time the inequality is evaluated (for each factor)?

isPrime :: Int -> Bool
isPrime n
  | n <= 2 = n == 2
  | otherwise = odd n && go 3
  where
    go factor
      | factor > upperLim = True
      | otherwise = n `rem` factor /= 0 && go (factor+2)
      where
        upperLim = (floor.sqrt.fromIntegral) n 
like image 428
bjpelcdev Avatar asked Jan 10 '15 13:01

bjpelcdev


3 Answers

I urge you to use a different algorithm, such as the Sieve of Eratosthenes discussed in the paper by Melissa O'Neill, or the version used in Math.NumberTheory.Primes from the arithmoi package, which also offers an optimized prime counting function. However, this might get you better constant factors:

-- is a number, n, prime? 
isPrime :: Int -> Bool
isPrime n
  | n <= 2 = n == 2
  | otherwise = odd n && -- Put the 2 here instead
        foldr (\d r -> d * d > n || (n `rem` d /= 0 && r))
                True divisors

-- list of divisors for which to test primality
divisors :: [Int]
{-# INLINE divisors #-} -- No guarantee, but it might possibly inline and stay inlined,
               -- so the numbers will be generated on each call instead of
               -- being pulled in (expensively) from RAM.
divisors = [3,5..] -- No more 2:

The reason to get rid of the 2: is that an optimization called "foldr/build fusion", "short cut deforestation", or just "list fusion" can, potentially, make your divisors list go away, but, at least with GHC < 7.10.1, that 2: will block the optimization.


Edit: it seems that's not working for you, so here's something else to try:

isPrime n
  | n <= 2 = n == 2
  | otherwise = odd n && go 3
  where
    go factor
      | factor * factor > n = True
      | otherwise = n `rem` factor /= 0 && go (factor+2) 
like image 180
dfeuer Avatar answered Nov 09 '22 23:11

dfeuer


In general I've found that looping in Haskell is about 3-4 times slower than what can be accomplished with C.

To help understand the performance difference I slightly modified the programs so that a fixed number of divisor tests are made per iteration and added a parameter e to control how many iterations are made - the number of (outer) iterations performed is 2^e. For each outer iteration approx. 2^21 divisor tests are made.

The source code for each program and scripts to run and analyze the results made be found here: https://github.com/erantapaa/loopbench

Pull-requests to improve the benchmarking are welcome.

Here are the results I get on a 2.4 GHz Intel Core 2 Duo using ghc 7.8.3 (under OSX). The gcc used was "Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)".

e     ctime   htime  allocated  gc-bytes alloc/iter  h/c      dns
10   0.0101  0.0200      87424      3408             1.980   4.61
11   0.0151  0.0345     112000      3408             2.285   4.51
12   0.0263  0.0700     161152      3408             2.661   5.09
13   0.0472  0.1345     259456      3408             2.850   5.08
14   0.0819  0.2709     456200      3408             3.308   5.50
15   0.1575  0.5382     849416      9616             3.417   5.54
16   0.3112  1.0900    1635848     15960             3.503   5.66
17   0.6105  2.1682    3208848     15984             3.552   5.66
18   1.2167  4.3536    6354576     16032  24.24      3.578   5.70
19   2.4092  8.7336   12646032     16128  24.12      3.625   5.75
20   4.8332 17.4109   25229080     16320  24.06      3.602   5.72

e          = exponent parameter
ctime      = running time of the C program
htime      = running time of the Haskell program
allocated  = bytes allocated in the heap (Haskell program)
gc-bytes   = bytes copied during GC (Haskell program)
alloc/iter = bytes allocated in the heap / 2^e
h / c      = htime divided by ctime
dns        = (htime - ctime) divided by the number of divisor tests made
             in nanoseconds

# divisor tests made = 2^e * 2^11

Some observations:

  1. The Haskell program performs heap allocation at a rate of about 24 bytes per (outer) loop iteration. The C program clearly does not perform any alloction and runs completely in L1 cache.
  2. The gc-bytes count remains constant for e between 10 and 14 because no garbage collections were performed for those runs.
  3. The time ratio h/c gets progressively worse as more allocations are made.
  4. dps is a measure of the extra time the Haskell program takes per divisor test; it increases with the total amount of allocation made. Also there are some plateaus which suggest this is due to cache effects.

It is well known that GHC does not produce the same tight loop code that a C compiler produces. The penalty you pay is approx. 4.6 ns per iteration. Moreover, it looks like Haskell is also affected by cache effects due to heap allocation.

24 bytes per allocation and 5 ns per loop iteration is not a lot for most programs, but when you have 2^20 allocations and 2^40 loop iterations it becomes a factor.

like image 3
ErikR Avatar answered Nov 10 '22 00:11

ErikR


The C code uses 32-bit integers, while the Haskell code uses 64-bit integers.

The original C code runs in 0.63 secs on my computer. However, if I replace the int-s with long-s, it runs in 2.07 seconds with gcc and 2.17 secs with clang.

In comparison, the updated isPrime function (see it in the thread question) runs in 2.09 seconds (with -O2 and -fllvm). Note that is slightly better than the clang-compiled C code, even though they use the same LLVM code generator.

The original Haskell code runs in 3.2 secs, which I think is an acceptable overhead for the convenience of using lists for iteration.

like image 2
András Kovács Avatar answered Nov 10 '22 00:11

András Kovács