Faster fundamental datastructures on multicore machines?

Question

I've been pondering this question for a while:

Can you build a faster fundamental datastructure (i.e. linked list, hashtable, set, skiplist, bloom filter, red-black tree, etc.) on a multicore machine, by taking advantage of the fact that you've got more than one CPU?

I did some preliminary experimenting with pthreads, and found that pthread_create() takes on the order of 30us, but a simple hash_map insert takes far less time than that on a single core. And thus, it's become hard for me to imagine creating a faster hash_map<>, since synchronization primitives and thread creation are so slow. I can also imagine tree traversal and balancing in parallel, but again, synchronization primitives would seem to make runtimes longer, not shorter.

It still feels intuitive to me that "I've got more CPU, and thus, I should be able to do it faster", but I can't quite wrap my head around a proof or counter-proof for that statement. I've experimented quite a bit in C++, but am now suspecting that other languages may offer better solutions (erlang?) for this task. Thoughts?

EDIT details: I think there are several programming / datastructure paradigms that are frequently used that could possibly be sped up. For example, I find myself frequently writing code that basically looks like this (where real data has been replaced with "rand()")

static const int N = 1000000; 
static const int M = 10000000; // 10x more lookups 
hash_map<int, int> m; 
// batch insert a bunch of interesting data 
for (int i = 0; i < N; i++) m[rand()] = rand(); 

// Do some random access lookups. 
for (int i = 0; i < M; i++) m[rand()]++;

This kind of paradigm is frequently used for things like name-value settings & configuration data, batch processing, etc. The 10x (or more) lookup/insert ratio is what makes a traditional hash_map<> ideal for this sort of operation.

This could easily be split in half, with an insert phase and a lookup phase, and in a parallel world, there may be some "flush queue" operation between the two halves. More difficult is the interleaved insert + lookup version:

hash_map<int, int> m; 

for (int i = 0; i < N; i++) { 
   if (rand() % LOOKUP_RATIO == 0) 
     hash_map[rand()]++;  // "lookup" 
   else 
     hash_map[rand()] = rand();  // "insert" 
}

In that scenario, insert could be asynchronous as long as the insert queue were flushed before each lookup, and if LOOKUP_RATIO is large enough (say, >1000) then it becomes quite similar to the batch example above, but with some queueing. Although, the queueing implies synchronization primitives.

Imagine for a second, the following snippet:

hash_map<int,int> a;
hash_map<int,int> b; 
for (int i = 0; i < N; i++) { 
  // the following 2 lines could be executed in parallel 
  a[rand()] = rand(); 
  b[rand()] = rand(); 
}

And thus, a lookup might be done in "parallel" by:

int lookup(int value) { 
  // The following 2 lines could be executed in parallel: 
  v1 = a[value]; 
  v2 = b[value]; 
  if (v1)  // pseudo code for "value existed in a" 
    return v1; 
  else 
    return v2; 
}

Rob Lachlan · Accepted Answer

The problem is that shared data is itself the bane of parallel computing. Ideally you want each core to be working on separate data, otherwise there will be overhead associated with synchronization. (How to communicate without shared state? By message passing.)

Also, it's a bit strange to talk about data structures being sped up. I find it more natural to talk about operations on the data structure being sped up, since different operations on different data structures have different characteristics. Is there a particular type of access that you want to speed up?

EDIT, in response to the extra details: I assume that the goal is to have a single hash map which can be accessed in parallel, and its underpinnings could be multiple hash tables, but which would be transparently presented to the user of this data structure as a single hash table. Naturally, we would be concerned about spending too much time spinning on locks. Also at this level, we have to be aware of cache consistency problems. That is, if cores or processors have separate caches pointing to the same data, and one modifies the data, then the cached data on the other is invalidated. If this happens repeatedly, it can impose huge costs, and parallelism could be worse than having things on a single core. So I'm very wary of shared data.

My instinct would be to have a pool of threads, each of which owns a different section of the hash table. The Hash would first map from key to hash table section, and then to an offset within that section. The update would be passed as a message to that thread which owns that section of the hash table. And that way, no one is trying to modify the same thing at once. Naturally, this is easier in languages (Erlang) which have features for asynchronous message passing concurrency than in others.

Faster fundamental datastructures on multicore machines?

Tags:

data-structures

multithreading

parallel-processing

slacy

1 Answers

Rob Lachlan

Recent Activity

Donate For Us

Faster fundamental datastructures on multicore machines?

Tags:

data-structures

multithreading

parallel-processing

slacy

1 Answers

Rob Lachlan

Related questions

Recent Activity

Donate For Us