Need for fast data demuxing in C# by using multi-threading, AVX, GPU or whatever

Question

I have a pretty simple function that demultiplex data acquired from a board. So data comes by frames, each frame made up by multiple signals, as a 1-dim array and I need to convert to jagged arrays, one for each signal. Basically the following: demuxing

I'm working in C# but I've a core function in C that does the work for a single signal:

void Functions::Demux(short*pMux, short* pDemux, int nSignals, int signalIndex, int nValuesPerSignal)
{
    short* pMuxStart = pMux + signalIndex;
    for (size_t i = 0; i < nValuesPerSignal; i++)
        *pDemux++ = *(pMuxStart + i * nSignals);
}

and then I call it via C++/CLI (using pin_ptr<short>, so no copy) from C# and with parallel for:

Parallel.For(0, nSignals, (int i) =>
{
    Core.Functions.Demux(muxed, demuxed[i], nSignals, i, nFramesPerSignal);
});

The muxed data comes from 16k signals (16bit resolution), each signal has 20k samples/s, which turns into a data rate of 16k * 20k * 2 = 640MB/s. When running the code on a workstation with 2 Xeon E5-2620 v4 (in total 16 cores @2.1GHz) it takes about 115% to demux (for 10s of data it takes 11.5s).

I need to go down at least to half the time. Does anybody knows of some way, perhaps with AVX technology, or better of some high performance library for this? Or maybe is there a way by using GPU? I would prefer not to improve the CPU hardware because that will cost probably more.

Edit Please consider that nSignals and nValuesPerSignal can change and that the interleaved array must be split in nSignals separate arrays to be further handled in C#.

Edit: further tests

In the meantime, following the remark from Cody Gray, I tested with a single core by:

void _Functions::_Demux(short*pMux, short** pDemux, int nSignals, int nValuesPerSignal)
{
    for (size_t i = 0; i < nSignals; i++)
    {
        for (size_t j = 0; j < nValuesPerSignal; j++)
            pDemux[i][j] = *pMux++;
    }
}

called from C++/CLI:

int nSignals = demuxValues->Length;
int nValuesPerSignal = demuxValues[0]->Length;

pin_ptr<short> pMux = &muxValues[0];

array<GCHandle>^ pins = gcnew array<GCHandle>(nSignals);
for (size_t i = 0; i < nSignals; i++)
    pins[i] = GCHandle::Alloc(demuxValues[i], GCHandleType::Pinned);

try
{
    array<short*>^ arrays = gcnew array<short*>(nSignals);
    for (int i = 0; i < nSignals; i++)
        arrays[i] = static_cast<short*>(pins[i].AddrOfPinnedObject().ToPointer());

    pin_ptr<short*> pDemux = &arrays[0];

    _Functions::_Demux(pMux, pDemux, nSignals, nValuesPerSignal);
}
finally
{ foreach (GCHandle pin in pins) pin.Free();    }

and I obtain a computing time of about 105%, which is too much but clearly shows that Parallel.For is not the right choice. From your replies I guess the only viable solution is SSE/AVX. I never wrote code for that, could some of you instruct me in the right direction for this? I think we can suppose that the processor will support always AVX2.

Edit: my initial code vs Matt Timmermans solution

On my machine I compared my initial code (where I was simply using Parallel.For and calling a C function responsible to deinterlace a single signal) with the code proposed by Matt Timmermans (still using Parallel.For but in a cleverer way). See results (in ms) against the number of tasks used in the Parallel.For (I have 32 threads):

N.Taks   MyCode   MattCode
4        1649     841
8        997      740
16       884      497
32       810      290

So performances are much improved. However, I'll still do some tests on the AVX idea.

Cody Gray · Accepted Answer

As I mentioned in a comment, you are very likely shooting yourself in the foot here by using Parallel.For. The overhead of multiple threads is far too great for the cost of this simple operation. If you need raw speed so much that you're dropping down to implement this in C++, you shouldn't be using C# at all for anything performance-critical.

Rather, you should be letting the C++ code process multiple signals at a time, in parallel. A good C++ compiler has a far more powerful optimizer than the C# JIT compiler, so it should be able to automatically vectorize the code, allowing you to write something readable yet fast. Compiler switches allow you to easily to indicate which instruction sets are available on your target machine(s): SSE2, SSSE3, AVX, AVX2, etc. The compiler will automatically emit the appropriate instructions.

If this is still not fast enough, you could consider manually writing the code using intrinsics to cause the desired SIMD instructions to be emitted. It is unclear from your question how variable the input is. Is the number of frames constant? What about the number of values per signal?

Assuming that your input looks exactly like the diagram, you could write the following implementation in C++, taking advantage of the PSHUFB instruction (supported by SSSE3 and later):

static const __m128i mask = _mm_setr_epi8(0, 1,   6,  7,  12, 13,
                                          2, 3,   8,  9,  14, 15,
                                          4, 5,  10, 11);

void Demux(short* pMuxed, short* pDemuxed, size_t count)
{
   for (size_t i = 0; i <= (count % 8); ++i)
   {
      _mm_store_si128((__m128i*)pDemuxed,
                      _mm_shuffle_epi8(_mm_load_si128((const __m128i*)pMuxed),
                                       mask));
      pMuxed   += 8;
      pDemuxed += 8;
   }
}

In a 128-bit SSE register, we can pack in 8 different 16-bit short values. Therefore, inside of the loop, this code loads the next 8 shorts from the input array, re-shuffles them so that they're in the desired order, and then stores the resulting sequence back into the output array. It has to loop enough times to do this for all sets of 8 shorts in the input array, so we do it count % 8 times.

The resulting assembly code is something like the following:

    mov     edx,  DWORD PTR [esp+12]       ; load parameters into registers (count)
    mov     ecx,  DWORD PTR [esp+8]        ; (pMuxed)
    mov     eax,  DWORD PTR [esp+4]        ; (pDemuxed)
    movdqa  xmm0, XMMWORD PTR [mask]       ; load 'mask' into XMM register
    and     edx,  7                        ; count % 8
    sub     ecx,  eax
    inc     edx

 Demux:
    movdqa  xmm1, XMMWORD PTR [ecx+eax]    ; load next 8 shorts from input array
    pshufb  xmm1, xmm0                     ; re-shuffle them
    movdqa  XMMWORD PTR [eax], xmm1        ; store these 8 shorts in output array

    add     eax, 16                        ; increment pointer
    dec     edx                            ; decrement counter...
    jne     Demux                          ;  and keep looping if necessary

(I wrote this code assuming that the input and output arrays are both aligned on 16-byte boundaries, which allows aligned loads and stores to be used. On older processors, this will be faster than unaligned loads; on newer generations of processors, the penalty for unaligned loads is virtually non-existent. This is easy to ensure and enforce in C/C++, but I'm not sure how you are allocating the memory for these arrays in the C# caller. If you control allocation, then you should be able to control alignment. If not, or you're only targeting late generations of processors that do not penalize unaligned loads, you may alter the code to do unaligned loads instead. Use the _mm_storeu_si128 and _mm_loadu_si128 intrinsics, which will cause MOVDQU instructions to be emitted, instead of MOVDQA.)

There are only 3 SIMD instructions inside of the loop, and the required loop overhead is minimal. This should be relatively fast, although there are almost certainly ways to make it even faster.

One significant optimization would be to avoid repeatedly loading and storing the data. In particular, to avoid storing the output into an output array. Depending on what you're going to do with the demuxed output, it would be more efficient to just leave it in an SSE register and work with it there. However, this won't interop well (if at all) with managed code, so you are very constrained if you must pass the results back to a C# caller.

To write truly efficient SIMD code, you want to have a high computation to load/store ratio. In other words, you want to do a lot of manipulation on the data in between the loads and stores. Here, you are only doing one operation (the shuffle) on the data in between the load and the store. Unfortunately, there is no way around that, unless you can interleave the subsequent "processing" code. Demuxing requires only one operation, meaning that your bottleneck will inevitably be the time it takes to read the input and write the output.

Another possible optimization would be manually unrolling the loop. But this introduces a number of potential complications and requires you to know something about the nature of your input. If the input arrays are generally short, unrolling doesn't make sense. If they are sometimes short and sometimes long, unrolling still may not make sense, because you'll have to explicitly deal with the case where the input array is short, breaking out of the loop early. If the input arrays are always rather long, then unrolling may be a performance win. Not necessarily, though; as mentioned above, the loop overhead here is pretty minimal.

If you need to parameterize based on the number of frames and number of values per signal, you will very likely have to write multiple routines. Or, at the very least, a variety of different masks. This will significantly increase the complexity of the code, and thus the maintenance cost (and potentially also the performance, since the instructions and data you need are less likely to be in the cache), so unless you can really do something that is significantly more optimal than a C++ compiler, you should consider letting the compiler generate the code.

Matt Timmermans · Answer

Good news! You don't need multiple cores, SIMD, or fancy packages to solve this problem. You probably don't even need to call out to C.

Your bottleneck is memory bandwidth, because you're using it inefficiently.

With that CPU, your memory is probably fast enough to demux > 3GB/s using one core, but each time you need a sample from RAM, the CPU fetches 64 bytes to fill a cache line, and you only use 2 of them. Those 64 bytes will hang around in cache for a while and some other threads will probably use some of them, but the access pattern is still really bad.

All you really have to do is make good use of those 64 bytes. There are many ways. For example:

1) Try a simple loop in C#. Run through your input buffer from start to end, putting each sample you encounter where it belongs. This will use all 64 bytes every time you fill a cache line in reading, and your 16K output channels are few enough that the blocks you're writing to will mostly stay cached. This will probably be fast enough.

2) Continue calling out to your C function, but process the input buffer in 2MB chunks, and don't bother with multiple cores. Each of those 2MB chunks is small enough to stay cached until you're done with it. This will probably be a bit faster than (1).

3) If the above aren't fast enough (it may be close), then it you can go multithreaded. Use method (2), but do a parallel For over the chunks. That way each core can do a whole 2MB chunk, making good use of its cache, and they won't contend with each other. Use at most 4 threads, or you may start stressing your cache again. If you really need to use more than 4 threads, then divide the work more finely, into groups of 1024 channels within each 2MB block or so... but you won't need to do that.

EDIT:

Oh, sorry -- option (1) is pretty difficult to implement in unsafe C# because each fixed statement only fixes a few pointers, and using managed arrays is too slow. Option (2) is easy in unsafe C#, though, and still works fine. I wrote a test:

public static unsafe void doit(short[] inarray, short[][] demux, int nSignals, int nSamples)
{
    fixed (short *pin=inarray)
    {
        for(int block=0; block < nSamples; block+=64)
        {
            for(int sig=0; sig<nSignals; ++sig)
            {
                fixed(short *psig = demux[sig])
                {
                    short* s = pin + block * nSignals + sig;
                    short* d = psig + block;
                    short* e = d + Math.Min(64, nSamples - block);
                    while(d<e)
                    {
                        *d++ = *s;
                        s += nSignals;
                    }
                }
            }
        }
    }
}
public static void Main()
{
    int nSignals = 16384;
    int nSamples = 20000;

    short[][] demux = new short[nSignals][];
    for (int i = 0; i < demux.Length; ++i)
    {
        demux[i] = new short[nSamples];
    }
    short[] mux = new short[nSignals * nSamples];
    //warm up
    doit(mux, demux, nSignals, nSamples);
    doit(mux, demux, nSignals, nSamples);
    doit(mux, demux, nSignals, nSamples);
    //time it
    var watch = System.Diagnostics.Stopwatch.StartNew();
    doit(mux, demux, nSignals, nSamples);
    watch.Stop();
    Console.WriteLine("Time (ms): " + watch.ElapsedMilliseconds);
}

That's one second of data, and on my box it says:

Time (ms): 724

Hmmm.. that's better than real time, but not twice as fast as real time. Your box looks to be a little faster than mine, so maybe it's OK. Lets try the parallel version (3). The Main function is the same:

public static unsafe void dopart(short[] inarray, short[][] demux, int offset, int nSignals, int nSamples)
{
    fixed (short* pin = inarray)
    {
        for (int block = 0; block < nSamples; block += 64)
        {
            for (int sig = 0; sig < nSignals; ++sig)
            {
                fixed (short* psig = demux[sig])
                {
                    short* s = pin + (offset + block) * nSignals + sig;
                    short* d = psig + offset + block;
                    short* e = d + Math.Min(64, nSamples - block);
                    while (d < e)
                    {
                        *d++ = *s;
                        s += nSignals;
                    }
                }
            }
        }
    }
}
public static unsafe void doit(short[] inarray, short[][] demux, int nSignals, int nSamples)
{
    int steps = (nSamples + 1023) / 1024;
    ParallelOptions options = new ParallelOptions();
    options.MaxDegreeOfParallelism = 4;
    Parallel.For(0, steps, options, step => {
        int offset = (int)(step * 1024);
        int size = Math.Min(1024, nSamples - offset);
        dopart(inarray, demux, offset, nSignals, size);
    });
}

That's better:

Time (ms): 283

The amount of data being read and written in this case is ~ 4.6 GB/s, which is a bit off my theoretical maximum of 6.4 GB/s, and I only have 4 real cores, so I might be able to get that down a bit by calling out to C, but there's not a lot of room for improvement.

Need for fast data demuxing in C# by using multi-threading, AVX, GPU or whatever

Tags:

performance

c#

algorithm

multithreading

avx

Mauro Ganswer

2 Answers

Cody Gray

Matt Timmermans

Recent Activity

Donate For Us

Need for fast data demuxing in C# by using multi-threading, AVX, GPU or whatever

Tags:

performance

c#

algorithm

multithreading

avx

Mauro Ganswer

2 Answers

Cody Gray

Matt Timmermans

Related questions

Recent Activity

Donate For Us