Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

My N-body program runs 100x slower when coded in Julia vs. when coded in C, why?

I was writing a direct nbody routine in julia, and portability issues made me rewrite it in C. After writing it, I was surprised by the speedup, I was expecting even an order of magnitude but not two!

I was wondering if it's normal such an speedup with only a rewriting in C, being Julia so focused on speed and HPC.

This are the codes, I have simplified them to make them concise preserving the speedup of C (all masses are 1, Force is just the distance of the two bodies).

The loop iterates over every index (The star propierties array is fixed size, but I only use the first 400 for the test) and computes the contribution of the rest of the indexes, then uses an Euler integrator to compute the new position (new velocity += F/m times dt, new position += velocity times dt).

C code, compiled with gcc and no special flags, time ./a.out gives 0.98s:

#include <stdio.h>
#include <stdlib.h>

// Array of stars is fixed size. It's initialized to a maximum size
// and only the needed portion it's used.
#define MAX_STAR_N (int)5e5

double *x,*y,*z,*vx,*vy,*vz;    

void evolve_bruteforce(double dt){
    // Compute forces and integrate the system with an Euler
    int i,j;
    for(i=0;i<400;i++){
        double cacheforce[3] = {0,0,0};
        double thisforce[3];
        for(j=0;j<400;j++){
            if(i!=j){
                thisforce[0] = (x[j] - x[i]);
                thisforce[1] = (y[j] - y[i]);
                thisforce[2] = (z[j] - z[i]);
                cacheforce[0] += thisforce[0];
                cacheforce[1] += thisforce[1];
                cacheforce[2] += thisforce[2];
            }
         }
        vx[i] += cacheforce[0]*dt;
        vy[i] += cacheforce[1]*dt;
        vz[i] += cacheforce[2]*dt;
    }
    for(i=0;i<400;i++){
       x[i] += vx[i]*dt;
       y[i] += vy[i]*dt;
       z[i] += vz[i]*dt;
    }
}



int main (int argc, char *argv[]){
    // Malloc all the arrays needed
    x   = malloc(sizeof(double)*MAX_STAR_N);
    y   = malloc(sizeof(double)*MAX_STAR_N);
    z   = malloc(sizeof(double)*MAX_STAR_N);
    vx  = malloc(sizeof(double)*MAX_STAR_N);
    vy  = malloc(sizeof(double)*MAX_STAR_N);
    vz  = malloc(sizeof(double)*MAX_STAR_N);

    int i;
    for(i=0;i<1000;i++)
    {
        evolve_bruteforce(0.001);
    }
}

Julia code, executed with julia -O --check-bounds=no, gives 102 seconds:

function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
    for i in 1:400
        cacheforce = [0.0,0.0,0.0]
        thisforce = Vector{Float64}(3)
        for j in 1:400
            if i != j
                thisforce[1] = (x[j] - x[i])
                thisforce[2] = (y[j] - y[i])
                thisforce[3] = (z[j] - z[i])
                cacheforce[1] += thisforce[1]
                cacheforce[2] += thisforce[2]
                cacheforce[3] += thisforce[3]
                vx[i] += cacheforce[1]*dt
                vy[i] += cacheforce[2]*dt
                vz[i] += cacheforce[3]*dt
            end
            for i in 1:400
                x[i] += vx[i]*dt
                y[i] += vy[i]*dt
                z[i] += vz[i]*dt
            end
        end
    end
end



function main()
    x = zeros(500000)
    y = zeros(500000)
    z = zeros(500000)
    vx = zeros(500000)
    vy = zeros(500000)
    vz = zeros(500000)
    @time for i in 1:1000
        evolve_bruteforce(0.001,x,y,z,vx,vy,vz)
    end
end

main()

I don't know how I can make this easier to answer out, if I can modify the post in any way please, let me know.

like image 747
RedPointyJackson Avatar asked Dec 01 '22 13:12

RedPointyJackson


2 Answers

As pointed out in the comments, the julia code is not equivalent to the C code. In the julia code the second for i in 1:400 is inside instead of after the first for loop. The code inside the if statement is also not the same.

Below is a version of evolve_bruteforce that matches the C code better:

function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
    for i in 1:400
        cacheforce = [0.0,0.0,0.0]
        thisforce = Vector{Float64}(3)
        for j in 1:400
            if i != j
                thisforce[1] = (x[j] - x[i])
                thisforce[2] = (y[j] - y[i])
                thisforce[3] = (z[j] - z[i])
                cacheforce[1] += thisforce[1]
                cacheforce[2] += thisforce[2]
                cacheforce[3] += thisforce[3]
            end
        end
        # this bit was inside the if statement
        vx[i] += cacheforce[1]*dt
        vy[i] += cacheforce[2]*dt
        vz[i] += cacheforce[3]*dt
    end
    # this loop was nested inside the first one
    for i in 1:400
        x[i] += vx[i]*dt
        y[i] += vy[i]*dt
        z[i] += vz[i]*dt
    end
end

It should be noted that the benchmarks in this answer are quite naïve and potentially unfair, there are a lot of different language- and compiler specific optimization that can be used boost performance.

The Julia code above gives execution times of roughly 2.2 and 1.7 seconds:

# without any flags
2.188550 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
2.199045 seconds (800.00 k allocations: 61.035 MB, 0.15% gc time)
2.194662 seconds (800.00 k allocations: 61.035 MB, 0.15% gc time)
# using the flags in the question: julia -O --check-bounds=on
1.688692 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
1.705764 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
1.688692 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)

On the same laptop the execution times for the C code in posted in the question are roughly 1.6 and 0.6 seconds:

# gcc without any flags
1.568s
1.585s
1.592s
# using gcc -Ofast
0.620s
0.594s
0.568s
like image 58
jarmokivekas Avatar answered Dec 09 '22 16:12

jarmokivekas


When using hard-coded 3 dimensional code, using Tuple type instead of Array is more appropriate (there will be no appending of additional physical dimensions in the middle of the simulation - even when doing superstring theory).

Rewriting @jarmokivekas's evolve_bruteforce like this:

function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
    for i in 1:400
        cacheforce = (0.0,0.0,0.0)
        thisforce = (0.0,0.0,0.0)
        for j in 1:400
            if i != j
                thisforce = ((x[j] - x[i]),(y[j] - y[i]),(z[j] - z[i]))
                cacheforce = (cacheforce[1]+thisforce[1],
                              cacheforce[2]+thisforce[2],
                              cacheforce[3]+thisforce[3])
            end
        end
        # this bit was inside the if statement
        (vx[i],vy[i],vz[i]) = (vx[i]+cacheforce[1]*dt,
                               vy[i]+cacheforce[2]*dt,
                               vz[i]+cacheforce[3]*dt)
    end
    # this loop was nested inside the first one
    for i in 1:400
         (x[i],y[i],z[i]) = (x[i]+vx[i]*dt,y[i]+vy[i]*dt,z[i]+vz[i]*dt)
    end
end

This gives another 2x speedup (from 1.1sec to 0.5sec on this machine).

like image 32
Dan Getz Avatar answered Dec 09 '22 15:12

Dan Getz