Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Segmented Aggregation within an Array

I have a large array of primitive value-types. The array is in fact one dimentional, but logically represents a 2-dimensional field. As you read from left to right, the values need to become (the original value of the current cell) + (the result calculated in the cell to the left). Obviously with the exception of the first element of each row which is just the original value.

I already have an implementation which accomplishes this, but is entirely iterative over the entire array and is extremely slow for large (1M+ elements) arrays.

Given the following example array,

0 0 1 0 0
2 0 0 0 3
0 4 1 1 0
0 1 0 4 1

Becomes

0 0 1 1 1
2 2 2 2 5
0 4 5 6 6
0 1 1 5 6

And so forth to the right, up to problematic sizes (1024x1024)

The array needs to be updated (ideally), but another array can be used if necessary. Memory footprint isn't much of an issue here, but performance is critical as these arrays have millions of elements and must be processed hundreds of times per second.

The individual cell calculations do not appear to be parallelizable given their dependence on values starting from the left, so GPU acceleration seems impossible. I have investigated PLINQ but requisite for indices makes it very difficult to implement.

Is there another way to structure the data to make it faster to process?

If efficient GPU processing is feasible using an innovative teqnique, this would be vastly preferable, as this is currently texture data which is having to be pulled from and pushed back to the video card.

like image 402
selkathguy Avatar asked Dec 10 '14 18:12

selkathguy


2 Answers

Proper coding and a bit of insight in how .NET knows stuff helps as well :-)

Some rules of thumb that apply in this case:

  1. If you can hint the JIT that the indexing will never get out of bounds of the array, it will remove the extra branche.
  2. You should vectorize it only in multiple threads if it's really slow (f.ex. >1 second). Otherwise task switching, cache flushes etc will probably just eat up the added speed and you'll end up worse.
  3. If possible, make memory access predictable, even sequential. If you need another array, so be it - if not, prefer that.
  4. Use as few IL instructions as possible if you want speed. Generally this seems to work.
  5. Test multiple iterations. A single iteration might not be good enough.

using these rules, you can make a small test case as follows. Note that I've upped the stakes to 4Kx4K since 1K is just so fast you cannot measure it :-)

public static void Main(string[] args)
{
    int width = 4096;
    int height = 4096;

    int[] ar = new int[width * height];
    Random rnd = new Random(213);
    for (int i = 0; i < ar.Length; ++i)
    {
        ar[i] = rnd.Next(0, 120);
    }

    // (5)...
    for (int j = 0; j < 10; ++j)
    {
        Stopwatch sw = Stopwatch.StartNew();

        int sum = 0;
        for (int i = 0; i < ar.Length; ++i)  // (3) sequential access
        {
            if ((i % width) == 0)
            {
                sum = 0;
            }

            // (1) --> the JIT will notice this won't go out of bounds because [0<=i<ar.Length]
            // (5) --> '+=' is an expression generating a 'dup'; this creates less IL.
            ar[i] = (sum += ar[i]); 
        }

        Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
    }
    Console.ReadLine();
}

One of these iterations wil take roughly 0.0174 sec here, and since this is about 16x the worst case scenario you describe, I suppose your performance problem is solved.

If you really want to parallize it to make it faster, I suppose that is possible, even though you will loose some of the optimizations in the JIT (Specifically: (1)). However, if you have a multi-core system like most people, the benefits might outweight these:

for (int j = 0; j < 10; ++j)
{
    Stopwatch sw = Stopwatch.StartNew();
    Parallel.For(0, height, (a) =>
    {
        int sum = 0;
        for (var i = width * a + 1; i < width * (a + 1); i++)
        {
            ar[i] = (sum += ar[i]);
        }
    });
    Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}

If you really, really need performance, you can compile it to C++ and use P/Invoke. Even if you don't use the GPU, I suppose the SSE/AVX instructions might already give you a significant performance boost that you won't get with .NET/C#. Also I'd like to point out that the Intel C++ compiler can automatically vectorize your code - even to Xeon PHI's. Without a lot of effort, this might give you a nice boost in performance.

like image 158
atlaste Avatar answered Nov 19 '22 07:11

atlaste


Well, I don't know too much about GPU, but I see no reason why you can't parallelize it as the dependencies are only from left to right.

There are no dependencies between rows.

0 0 1 0 0  - process on core1 |
2 0 0 0 3  - process on core1 |
-------------------------------
0 4 1 1 0  - process on core2 |
0 1 0 4 1  - process on core2 |

Although the above statement is not completely true. There's still hidden dependencies between rows when it comes to memory cache.

It's possible that there's going to be cache trashing. You can read about "cache false sharing", in order to understand the problem, and see how to overcome that.

like image 2
Erti-Chris Eelmaa Avatar answered Nov 19 '22 07:11

Erti-Chris Eelmaa