Nested loop traversing arrays

Question

There are 2 very big series of elements, the second 100 times bigger than the first. For each element of the first series, there are 0 or more elements on the second series. This can be traversed and processed with 2 nested loops. But the unpredictability of the amount of matching elements for each member of the first array makes things very, very slow.

The actual processing of the 2nd series of elements involves logical and (&) and a population count.

I couldn't find good optimizations using C but I am considering doing inline asm, doing rep* mov* or similar for each element of the first series and then doing the batch processing of the matching bytes of the second series, perhaps in buffers of 1MB or something. But the code would be get quite messy.

Does anybody know of a better way? C preferred but x86 ASM OK too. Many thanks!

Sample/demo code with simplified problem, first series are "people" and second series are "events", for clarity's sake. (the original problem is actually 100m and 10,000m entries!)

#include <stdio.h>
#include <stdint.h>

#define PEOPLE 1000000    //   1m
struct Person {
    uint8_t age;   // Filtering condition
    uint8_t cnt;   // Number of events for this person in E
} P[PEOPLE]; // Each has 0 or more bytes with bit flags

#define EVENTS 100000000  // 100m
uint8_t P1[EVENTS]; // Property 1 flags
uint8_t P2[EVENTS]; // Property 2 flags

void init_arrays() {
    for (int i = 0; i < PEOPLE; i++) { // just some stuff
        P[i].age = i & 0x07;
        P[i].cnt = i % 220; // assert( sum < EVENTS );
    }
    for (int i = 0; i < EVENTS; i++) {
        P1[i]    = i % 7;  // just some stuff
        P2[i]    = i % 9;  // just some other stuff
    }
}

int main(int argc, char *argv[])
{
    uint64_t   sum = 0, fcur = 0;

    int age_filter = 7; // just some

    init_arrays();      // Init P, P1, P2

    for (int64_t p = 0; p < PEOPLE ; p++)
        if (P[p].age < age_filter)
            for (int64_t e = 0; e < P[p].cnt ; e++, fcur++)
                sum += __builtin_popcount( P1[fcur] & P2[fcur] );
        else
            fcur += P[p].cnt; // skip this person's events

    printf("(dummy %ld %ld)
", sum, fcur );

    return 0;
}

gcc -O5 -march=native -std=c99 test.c -o test

Sergey Kalinichenko · Accepted Answer

Since on average you get 100 items per person, you can speed things up by processing multiple bytes at a time. I re-arranged the code slightly in order to use pointers instead of indexes, and replaced one loop by two loops:

uint8_t *p1 = P1, *p2 = P2;
for (int64_t p = 0; p < PEOPLE ; p++) {
    if (P[p].age < age_filter) {
        int64_t e = P[p].cnt;
        for ( ; e >= 8 ; e -= 8) {
            sum += __builtin_popcountll( *((long long*)p1) & *((long long*)p2) );
            p1 += 8;
            p2 += 8;
        }
        for ( ; e ; e--) {
            sum += __builtin_popcount( *p1++ & *p2++ );
        }
    } else {
        p1 += P[p].cnt;
        p2 += P[p].cnt;
    }
}

In my testing this speeds up your code from 1.515s to 0.855s.

Aki Suihkonen · Answer

The answer by Neil doesn't require sorting by age, which btw could be a good idea --

If the second loop has holes (please correct original source code to support that idea), a common solution is to do cumsum[n+1]=cumsum[n]+__popcount(P[n]&P2[n]);
Then for each people sum+=cumsum[fcur + P[p].cnt] - cumsum[fcur];

Anyway it seems that the computational burden is merely of order EVENTS, not EVENTS*PEOPLE. Some optimization can anyway take place by calling the inner loop for all the consecutive people meeting the condition.

If there are really max 8 predicates, it could makes sense to precalculate all the
sums (_popcounts(predicate[0..255])) for each people into separate arrays C[256][PEOPLE]. That just about doubles the memory requirements (on disk?), but localizes the search from 10GB+10GB+...+10GB (8 predicates) to one stream of 200MB (assuming 16 bit entries).

Depending on the probability of p(P[i].age < condition && P[i].height < cond2), it may not anymore make sense to calculate cumulative sums. Maybe, maybe not. More likely just some SSE parallelism 8 or 16 people at a time will do.

Nested loop traversing arrays

Tags:

performance

c

optimization

assembly

alecco

2 Answers

Sergey Kalinichenko

Aki Suihkonen

Recent Activity

Donate For Us

Nested loop traversing arrays

Tags:

performance

c

optimization

assembly

alecco

2 Answers

Sergey Kalinichenko

Aki Suihkonen

Related questions

Recent Activity

Donate For Us