Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numerous short lived tasks slower when multithreaded even with threadpool

Background

I currently have a linear physics engine (but this question does not require physics engine knowledge), parts of which I'm experimenting with multithreading with the hope of increasing efficiency.

One such section is the broad phase, in this case this involves moving through all the objects along all 3 axes to check which overlap (any which occur on all axes are considered to have collided in broad phase). The 3 axis sweeps are, apart from using common objects, completely independent and so seemed a good place for multithreading.

In order to avoid the possibility of blocking between threads each of these 3 processes takes a local copy of all data it wants to use before (if applicable) going multithreaded

While these sweeps are a significant bottle neck they are very short lived, a sweep typically lasts 1-4ms. This is a real time application where the code is run 60 times a second so total tick time is at maximum 17ms so 1-4ms is a long time for me. Because these sweeps are short lived I have used a thread pool. Specifically a Executors.newFixedThreadPool(3), 3 for the 3 axes.

My test computer was a dual core with hyper threading, so up to 4 threads should be comfortable. Checked using Runtime.getRuntime().availableProcessors();

Question

When running the following test code, in which numerous short lived tasks were run either single threaded or multithreaded using a threadpool the multithreaded version was much slower; see profile data. This was the case even when the multithreaded parts had no objects in common. Why was this and is there any way to run many short lived (1-4ms) tasks concurrently?

Even making the tasks much much larger only makes the multithreaded version approach the single threaded in performance not exceed it as I had expected, which makes me think i'm doing something seriously wrong.

enter image description here

Test code

public class BroadPhaseAxisSweep implements Callable<Set<PotentialCollisionPrecursor>>  {

    static final int XAXIS=0;
    static final int YAXIS=1;
    static final int ZAXIS=2;

    int axis; 
    int[] axisIndicies;
    boolean[] isStatic;
    boolean[] isLightWeight; 
    boolean[] isCollidable; 

    //orders the same as axisIndicies
    double[] starts;
    double[] ends;

    private static ExecutorService sweepPool = Executors.newFixedThreadPool(3);

    public BroadPhaseAxisSweep(int axis, List<TestObject> allObjects) {
        //all data that will be used by the thread is cached internally to avoid 
        //any concurrent access issues

        this.axis = axis;

        //allObjects is in reality unsorted, axisIndicies holds sorted indices
        //in this case allObjects just "happens" to be already sorted
        this.axisIndicies =new int[allObjects.size()];
        for(int i=0;i<allObjects.size();i++){
            axisIndicies[i]=i;
        }
        isStatic=new boolean[allObjects.size()];
        for(int i=0;i<allObjects.size();i++){
            isStatic[i]=allObjects.get(i).isStatic();
        }
        isLightWeight=new boolean[allObjects.size()];
        for(int i=0;i<allObjects.size();i++){
            isLightWeight[i]=allObjects.get(i).isLightWeightPhysicsObject();
        }
        isCollidable=new boolean[allObjects.size()];
        for(int i=0;i<allObjects.size();i++){
            isCollidable[i]=allObjects.get(i).isCollidable();
        }

        starts=new double[allObjects.size()];
        for(int i=0;i<allObjects.size();i++){
            starts[i]=allObjects.get(i).getStartPoint();
        }
        ends=new double[allObjects.size()];
        for(int i=0;i<allObjects.size();i++){
            ends[i]=allObjects.get(i).getEndPoint();
        }
    }


    @Override
    public Set<PotentialCollisionPrecursor> call() throws Exception {
        return axisSweep_simple(axisIndicies);
    }

    private Set<PotentialCollisionPrecursor> axisSweep_simple(int[] axisIndicies){

        Set<PotentialCollisionPrecursor> thisSweep =new HashSet();


        for(int i=0;i<starts.length;i++){
            if (isCollidable[axisIndicies[i]]){
                double activeObjectEnd=ends[i];
                //sweep forwards until an objects start is before out end
                for(int j=i+1;j<starts.length;j++){
                    //j<startXsIndicies.length is the bare mininmum contrain, most js wont get that far
                    if ((isStatic[axisIndicies[i]]&& isStatic[axisIndicies[j]]) || ((isLightWeight[axisIndicies[i]]&& isLightWeight[axisIndicies[j]]))){
                        //if both objects are static or both are light weight then they cannot by definition collide, we can skip
                        continue;
                    }


                    if (activeObjectEnd>starts[j]){
                        PotentialCollisionPrecursor potentialCollision=new PotentialCollisionPrecursor(getObjectNumberFromAxisNumber(i),getObjectNumberFromAxisNumber(j));
                            thisSweep.add(potentialCollision);
                    }else{
                        break; //this is as far as this active object goes

                    }

                }
            }
        }

        return thisSweep;
    }


    private int getObjectNumberFromAxisNumber(int number){
        return axisIndicies[number];
    }


     public static void main(String[] args){
         int noOfObjectsUnderTest=250;

         List<TestObject> testObjects=new ArrayList<>();

         Random rnd=new Random();
         double runningStartPosition=0;
         for(int i=0;i<noOfObjectsUnderTest;i++){
             runningStartPosition+=rnd.nextDouble()*0.01;
             testObjects.add(new TestObject(runningStartPosition));
         }

         while(true){
             runSingleTreaded(testObjects);
             runMultiThreadedTreaded(testObjects);
         }

     }

    private static void runSingleTreaded(List<TestObject> testObjects) {
        try {
            //XAXIS used over and over again just for test
            Set<PotentialCollisionPrecursor> xSweep=(new BroadPhaseAxisSweep(XAXIS,testObjects)).call();
            Set<PotentialCollisionPrecursor> ySweep=(new BroadPhaseAxisSweep(XAXIS,testObjects)).call();
            Set<PotentialCollisionPrecursor> zSweep=(new BroadPhaseAxisSweep(XAXIS,testObjects)).call();

            System.out.println(xSweep.size()); //just so JIT can't possibly optimise out
            System.out.println(ySweep.size()); //just so JIT can't possibly optimise out
            System.out.println(zSweep.size()); //just so JIT can't possibly optimise out
        } catch (Exception ex) {
            //bad practice, example only
            Logger.getLogger(BroadPhaseAxisSweep.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

    private static void runMultiThreadedTreaded(List<TestObject> testObjects) {
        try {
            //XAXIS used over and over again just for test
            Future<Set<PotentialCollisionPrecursor>> futureX=sweepPool.submit(new BroadPhaseAxisSweep(XAXIS,testObjects));
            Future<Set<PotentialCollisionPrecursor>> futureY=sweepPool.submit(new BroadPhaseAxisSweep(XAXIS,testObjects));
            Future<Set<PotentialCollisionPrecursor>> futureZ=sweepPool.submit(new BroadPhaseAxisSweep(XAXIS,testObjects));

            Set<PotentialCollisionPrecursor> xSweep=futureX.get();
            Set<PotentialCollisionPrecursor> ySweep=futureY.get();
            Set<PotentialCollisionPrecursor> zSweep=futureZ.get();

            System.out.println(xSweep.size()); //just so JIT can't possibly optimise out
            System.out.println(ySweep.size()); //just so JIT can't possibly optimise out
            System.out.println(zSweep.size()); //just so JIT can't possibly optimise out
        } catch (Exception ex) {
            //bad practice, example only
            Logger.getLogger(BroadPhaseAxisSweep.class.getName()).log(Level.SEVERE, null, ex);
        }
    }


    public static class TestObject{

        final boolean isStatic;
        final boolean isLightWeight;
        final boolean isCollidable;
        final double startPointOnAxis;
        final double endPointOnAxis; 

        public TestObject(double startPointOnAxis) {
            Random rnd=new Random();
            this.isStatic = rnd.nextBoolean();
            this.isLightWeight =  rnd.nextBoolean();
            this.isCollidable =  rnd.nextBoolean();
            this.startPointOnAxis = startPointOnAxis;
            this.endPointOnAxis =startPointOnAxis+0.2*rnd.nextDouble();
        }

        public boolean isStatic() {
            return isStatic;
        }

        public boolean isLightWeightPhysicsObject() {
            return isLightWeight;
        }

        public boolean isCollidable() {
            return isCollidable;
        }

        public double getStartPoint() {
            return startPointOnAxis;
        }

        public double getEndPoint() {
            return endPointOnAxis;
        }
    }

}

public class PotentialCollisionPrecursor {
    //holds the object numbers of a potential collision, can be converted to a real PotentialCollision using a list of those objects
    private final int rigidBodyNumber1;
    private final int rigidBodyNumber2; 


    public PotentialCollisionPrecursor(int rigidBodyNumber1, int rigidBodyNumber2) {
        if (rigidBodyNumber1<rigidBodyNumber2){
            this.rigidBodyNumber1 = rigidBodyNumber1;
            this.rigidBodyNumber2 = rigidBodyNumber2;
        }else{
            this.rigidBodyNumber1 = rigidBodyNumber2;
            this.rigidBodyNumber2 = rigidBodyNumber1;
        }
    }

    public int getRigidBodyNumber1() {
        return rigidBodyNumber1;
    }

    public int getRigidBodyNumber2() {
        return rigidBodyNumber2;
    }

    @Override
    public int hashCode() {
        int hash = 7;
        hash = 67 * hash + this.rigidBodyNumber1;
        hash = 67 * hash + this.rigidBodyNumber2;
        return hash;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        final PotentialCollisionPrecursor other = (PotentialCollisionPrecursor) obj;
        if (this.rigidBodyNumber1 != other.rigidBodyNumber1) {
            return false;
        }
        if (this.rigidBodyNumber2 != other.rigidBodyNumber2) {
            return false;
        }
        return true;
    }

}

ThreadPools of differing sizes

After single threaded the next fastest is a pool of 2/3 threads, then the slowest of all is a single thread in a thread pool (unsurprisingly as that has all the overhead and none of the gains)

enter image description here

Unnaturally large tasks size

To test if the problem was simply having too small tasks for threading I increased the task size to around 100ms. These gave even more confusing results; any number of thread between 1 and 3 is approximately the same speed and is slower than single threaded

enter image description here

like image 606
Richard Tingle Avatar asked Oct 02 '22 18:10

Richard Tingle


2 Answers

If your broad sweeps are only taking a few milliseconds, you're probably better off doing everything synchronously. The effort required to reserve a thread pool thread (on Windows) takes longer than 5ms. Not to mention that you still have to move data around between threads and wait on context switches and then finally put the threads back where you found them.

This whole process is likely to just degrade performance, especially since you're taking local copies of the data. If each sweep was naturally independent and took upwards of 500ms, you could probably benefit from some concurrency model like you have implemented.

One interesting thing to note is that graphics processors these days come with embedded co-processors dedicated to physics calculations. The reason they're so good at doing things like this is because they have sometimes thousands of processor cores all running at relatively low clock rates. This means that they're very good for taking on lots of small-ish tasks concurrently. You may want to try interfacing with a graphics processor directly to offload the physics processing into that kind of environment rather than work with it on a general CPU.

like image 70
Michael J. Gray Avatar answered Oct 13 '22 12:10

Michael J. Gray


I promised to summarize all the findings here... it's rather confusing as the main culprit is the profiler which slows down the non main threads disproportionately to the main thread. Maybe it uses atomic counters for tracking it's data and maybe their's overhead is so high it leads to such non-sensible results.

Measuring the time manually gives better results, namely a multi-threading speedup of 30-40 %. This makes sense as there's big sequential overhead due to data copying.

This copying is neither necessary nor useful. It's unnecessary as all threads only read shared data. It's useless as reading shared variables is no slower than reading an own copy thereof:

  • the cores are capable of getting the data from each other's cache pretty fast
  • they place a copy of them into their local L1 and L2 caches (the "shared" state of the MESI protocol)
  • the L3 cache is shared, which means that needlessly copied data imply more L3 misses due to higher memory footprint
like image 24
maaartinus Avatar answered Oct 13 '22 10:10

maaartinus