Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing numerical array performance in Haskell

I'm working on a terrain generation algorithm for a MineCraft-like world. Currently, I'm using simplex noise based on the implementation in the paper 'Simplex Noise Demystified' [PDF], since simplex noise is supposed to be faster and to have fewer artifacts than Perlin noise. This looks fairly decent (see image), but so far it's also pretty slow.

enter image description here

Running the noise function 10 times (I need noise with different wavelengths for things like terrain height, temperature, tree location, etc.) with 3 octaves of noise for each block in a chunk (16x16x128 blocks), or about 1 million calls to the noise function in total, takes about 700-800 ms. This is at least an order of magnitude too slow for the purposes of generating terrain with any decent kind of speed, despite the fact that there are no obvious expensive operations in the algorithm (at least to me). Just floor, modulo, some array lookups and basic arithmetic. The algorithm (written in Haskell) is listed below. The SCC comments are for profiling. I've omitted the 2D noise functions, since they work the same way.

g3 :: (Floating a, RealFrac a) => a g3 = 1/6  {-# INLINE int #-} int :: (Integral a, Num b) => a -> b int = fromIntegral  grad3 :: (Floating a, RealFrac a) => V.Vector (a,a,a) grad3 = V.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0),                      (1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1),                      (0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)]  {-# INLINE dot3 #-} dot3 :: Num a => (a, a, a) -> a -> a -> a -> a dot3 (a,b,c) x y z = a * x + b * y + c * z  {-# INLINE fastFloor #-} fastFloor :: RealFrac a => a -> Int fastFloor x = truncate (if x > 0 then x else x - 1)  --Generate a random permutation for use in the noise functions perm :: Int -> Permutation perm seed = V.fromList . concat . replicate 2 . shuffle' [0..255] 256 $ mkStdGen seed  --Generate 3D noise between -0.5 and 0.5 simplex3D :: (Floating a, RealFrac a) => Permutation -> a -> a -> a -> a simplex3D p x y z = {-# SCC "out" #-} 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where     (i,j,k) = {-# SCC "ijk" #-} (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3)     (x0,y0,z0) = {-# SCC "x0-z0" #-} (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3     (i1,j1,k1,i2,j2,k2) = {-# SCC "i1-k2" #-} if x0 >= y0         then if y0 >= z0 then (1,0,0,1,1,0) else              if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1)         else if y0 <  z0 then (0,0,1,0,1,1) else              if x0 <  z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0)     xyz1 = {-# SCC "xyz1" #-} (x0 - int i1 +   g3, y0 - int j1 +   g3, z0 - int k1 +   g3)     xyz2 = {-# SCC "xyz2" #-} (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3)     xyz3 = {-# SCC "xyz3" #-} (x0 - 1      + 3*g3, y0 - 1      + 3*g3, z0 - 1      + 3*g3)     (ii,jj,kk) = {-# SCC "iijjkk" #-} (i .&. 255, j .&. 255, k .&. 255)     gi0 = {-# SCC "gi0" #-} mod (p V.! (ii +      p V.! (jj +      p V.!  kk      ))) 12     gi1 = {-# SCC "gi1" #-} mod (p V.! (ii + i1 + p V.! (jj + j1 + p V.! (kk + k1)))) 12     gi2 = {-# SCC "gi2" #-} mod (p V.! (ii + i2 + p V.! (jj + j2 + p V.! (kk + k2)))) 12     gi3 = {-# SCC "gi3" #-} mod (p V.! (ii + 1  + p V.! (jj + 1  + p V.! (kk + 1 )))) 12     {-# INLINE n #-}     n gi (x',y',z') = {-# SCC "n" #-} (\a -> if a < 0 then 0 else         a*a*a*a*dot3 (grad3 V.! gi) x' y' z') $ 0.6 - x'*x' - y'*y' - z'*z'  harmonic :: (Num a, Fractional a) => Int -> (a -> a) -> a harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where     f 0 = 0     f o = let r = int $ 2 ^ (o - 1) in noise r / r + f (o - 1)  --Generate harmonic 3D noise between -0.5 and 0.5 harmonicNoise3D :: (RealFrac a, Floating a) => Permutation -> Int -> a -> a -> a -> a -> a harmonicNoise3D p octaves l x y z = harmonic octaves     (\f -> simplex3D p (x * f / l) (y * f / l) (z * f / l)) 

For profiling, I used the following code,

q _ = let p = perm 0 in       sum [harmonicNoise3D p 3 l x y z :: Float | l <- [1..10], y <- [0..127], x <- [0..15], z <- [0..15]]  main = do start <- getCurrentTime           print $ q ()           end <- getCurrentTime           print $ diffUTCTime end start 

which produces the following information:

COST CENTRE                    MODULE               %time %alloc  simplex3D                      Main                  18.8   21.0 n                              Main                  18.0   19.6 out                            Main                  10.1    9.2 harmonicNoise3D                Main                   9.8    4.5 harmonic                       Main                   6.4    5.8 int                            Main                   4.0    2.9 gi3                            Main                   4.0    3.0 xyz2                           Main                   3.5    5.9 gi1                            Main                   3.4    3.4 gi0                            Main                   3.4    2.7 fastFloor                      Main                   3.2    0.6 xyz1                           Main                   2.9    5.9 ijk                            Main                   2.7    3.5 gi2                            Main                   2.7    3.3 xyz3                           Main                   2.6    4.1 iijjkk                         Main                   1.6    2.5 dot3                           Main                   1.6    0.7 

To compare, I also ported the algorithm to C#. Performance there was about 3 to 4 times faster, so I imagine I must be doing something wrong. But even then it's not nearly as fast as I would like. So my question is this: can anyone tell me if there are any ways to speed up my implementation and/or the algorithm in general or does anyone know of a different noise algorithm that has better performance characteristics but a similar look?

Update:

After following some of the suggestions offered below, the code now looks as follows:

module Noise ( Permutation, perm              , noise3D, simplex3D              ) where  import Data.Bits import qualified Data.Vector.Unboxed as UV import System.Random import System.Random.Shuffle  type Permutation = UV.Vector Int  g3 :: Double g3 = 1/6  {-# INLINE int #-} int :: Int -> Double int = fromIntegral  grad3 :: UV.Vector (Double, Double, Double) grad3 = UV.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0),                      (1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1),                      (0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)]  {-# INLINE dot3 #-} dot3 :: (Double, Double, Double) -> Double -> Double -> Double -> Double dot3 (a,b,c) x y z = a * x + b * y + c * z  {-# INLINE fastFloor #-} fastFloor :: Double -> Int fastFloor x = truncate (if x > 0 then x else x - 1)  --Generate a random permutation for use in the noise functions perm :: Int -> Permutation perm seed = UV.fromList . concat . replicate 2 . shuffle' [0..255] 256 $ mkStdGen seed  --Generate 3D noise between -0.5 and 0.5 noise3D :: Permutation -> Double -> Double -> Double -> Double noise3D p x y z = 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where     (i,j,k) = (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3)     (x0,y0,z0) = (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3     (i1,j1,k1,i2,j2,k2) = if x0 >= y0         then if y0 >= z0 then (1,0,0,1,1,0) else              if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1)         else if y0 <  z0 then (0,0,1,0,1,1) else              if x0 <  z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0)     xyz1 = (x0 - int i1 +   g3, y0 - int j1 +   g3, z0 - int k1 +   g3)     xyz2 = (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3)     xyz3 = (x0 - 1      + 3*g3, y0 - 1      + 3*g3, z0 - 1      + 3*g3)     (ii,jj,kk) = (i .&. 255, j .&. 255, k .&. 255)     gi0 = rem (UV.unsafeIndex p (ii +      UV.unsafeIndex p (jj +      UV.unsafeIndex p  kk      ))) 12     gi1 = rem (UV.unsafeIndex p (ii + i1 + UV.unsafeIndex p (jj + j1 + UV.unsafeIndex p (kk + k1)))) 12     gi2 = rem (UV.unsafeIndex p (ii + i2 + UV.unsafeIndex p (jj + j2 + UV.unsafeIndex p (kk + k2)))) 12     gi3 = rem (UV.unsafeIndex p (ii + 1  + UV.unsafeIndex p (jj + 1  + UV.unsafeIndex p (kk + 1 )))) 12     {-# INLINE n #-}     n gi (x',y',z') = (\a -> if a < 0 then 0 else         a*a*a*a*dot3 (UV.unsafeIndex grad3 gi) x' y' z') $ 0.6 - x'*x' - y'*y' - z'*z'  harmonic :: Int -> (Double -> Double) -> Double harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where     f 0 = 0     f o = let r = 2 ^^ (o - 1) in noise r / r + f (o - 1)  --3D simplex noise --syntax: simplex3D permutation number_of_octaves wavelength x y z simplex3D :: Permutation -> Int -> Double -> Double -> Double -> Double -> Double simplex3D p octaves l x y z = harmonic octaves     (\f -> noise3D p (x * f / l) (y * f / l) (z * f / l)) 

Together with reducing my chunk size to 8x8x128, generating new terrain chunks now occurs at about 10-20 fps, which means moving around is now not nearly as problematic as before. Of course, any other performance improvements are still welcome.

like image 345
FalconNL Avatar asked Apr 18 '11 18:04

FalconNL


1 Answers

The thing that stands out initially is that your code is highly polymorphic. You should specialize your floating point type uniformly to Double, so GHC (and LLVM) have a chance of applying more aggressive optimizations.

Note, for those trying to reproduce, this code imports:

import qualified Data.Vector as V import Data.Bits import Data.Time.Clock import System.Random import System.Random.Shuffle  type Permutation = V.Vector Int 

Ok. There's lots of things you can try to improve this code.

Improvements

Data representation

  • Specialize to a concrete floating point type, instead of polymorphic floating point functions
  • Replace tuple (a,a,a) with unboxed triple T !Double !Double !Double
  • Switch from Data.Array to Data.Array.Unboxed for Permutations
  • Replace use of boxed array of triples with multidimensional unboxed array from repa package

Compiler flags

  • Compile with -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native (or equivalent with -fllvm)
  • Increase spec constr threshold -- -fspec-constr-count=16

More efficient library functions

  • Use mersenne-random instead of StdGen to generate randoms
  • Replace mod with rem
  • Replace V.! indexing with unchecked indexing VU.unsafeIndex (after moving to Data.Vector.Unboxed

Runtime settings

  • Increase the default allocation area: -A20M or -H

Also, check your algorithm is identical to the C# one, and you're using the same data structures.

like image 58
Don Stewart Avatar answered Sep 20 '22 07:09

Don Stewart