Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving "shuffle" efficiency

Right now, I am using the following code to create a Shuffle extension:

public static class SiteItemExtensions
{
    public static void Shuffle<T>(this IList<T> list)
    {
        var rng = new Random();
        int n = list.Count;
        while (n > 1)
        {
            n--;
            int k = rng.Next(n + 1);
            T value = list[k];
            list[k] = list[n];
            list[n] = value;
        }
    }
}

I am looking for a way more faster and efficient way to do this. Right now, using the Stopwatch class, it is taking about 20 seconds to shuffle 100,000,000 items. Does anyone have any ideas to make this faster?

like image 940
Icemanind Avatar asked Sep 30 '11 02:09

Icemanind


People also ask

Is shuffling good in Spark?

Spark shuffle is a very expensive operation as it moves the data between executors or even between worker nodes in a cluster so try to avoid it when possible.

What causes a shuffle operation in Spark?

Operations which can cause a shuffle include repartition operations like repartition and coalesce, 'ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join. Ok so like every operation that everyone new to spark wants to do.

How does Spark handle data shuffle?

In Apache Spark, Spark Shuffle describes the procedure in between reduce task and map task. Shuffling refers to the shuffle of data given. This operation is considered the costliest. Parallelising effectively of the spark shuffle operation gives performance output as good for spark jobs.


1 Answers

This highlights an aspect of modern computer design that's often overlooked. It can be made over 3 times faster by a silly change:

            int k = 0; rng.Next(n + 1);  // silly change

There are now more statements in the inner loop but it is much faster. What you are seeing is the effect of the CPU cache. This algorithm has very poor cache locality, the odds that the next element to be read from the array is already in the cache is very low. Which takes an expensive trip to the slower outer caches and the dead-slow memory bus. The odds that elements from the array needed later are loaded into the cache is very high. But the odds that they are still there when needed is very low, your list is simply too large the fit the cache.

Nothing can be done to fix this, it is inherent in the algorithm design. Using realistic list sizes is however an obvious solution. Running it 100,000 times on a list with a 1000 elements is 3 times as fast.

like image 194
Hans Passant Avatar answered Oct 04 '22 16:10

Hans Passant