Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

3 integer key lookup in CUDA

I'd like to look up 3 integers (i.e. [1 2 3]) in a large data set of around a million points.

I'm currently using MATLAB's Map (hashmap), and for each point I'm doing the following:

key = sprintf('%d ', [1 2 3]);       % 23 us
% key = '1 2 3 '
result = lookup_map( key );          % 32 us

This is quite time consuming though - 1 million points * 55 us = 55 seconds.

I'd like to move this to the GPU using CUDA, but I'm not sure of the best way of approaching this.

I could transfer four arrays - key1, key2, key3, result, and then perform binary search on the keys, but this would take 20 iterations (2^20 = 1048576) per key. Then I'd also have delays due to concurrent memory access from each thread.

Is there a data structure optimised for parallel (O(1), ideally) multiple key lookups in CUDA?


Q: What are the bounds of the three integers? And what data is looked up?

The integer keys can be between 0 and ~75,000 currently, but may be bigger (200,000+) in the future.

For the purposes of this question, we can assume that the result is an integer between 0 and the size of the data set.


Q: Why don't you pack all three numbers into one 64bit number (21 bits per number gives you a range of 0-2,097,152). And use that to index into a sparse array?

>> A = uint64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'uint64'.

>> A = int64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'int64'.

It appears that my matlab doesn't support sparse arrays of 64-bit numbers.

In case this helps anyone else, I wrote a quick function to create a 64-bit key from three <2^21 unsigned integers:

function [key] = to_key(face)
    key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1));
end

Q: From @Dennis - why not use logical indexing?

Let's test it!

% Generate a million random integers between 0 and 1000
>> M = int32(floor(rand(10000000,4)*1000));
% Find a point to look for
>> search =  M(500000,1:3)             
search =
         850         910         581  
>> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc;
Elapsed time is 0.089801 seconds.
>> M(idx,:)
ans =
         850         910         581         726

Unfortunately this takes 89801us, which is 1632x slower than my existing solution (55us)! It would take 2.5 hours to run this a million times!

We could try filtering M after each search:

>> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc;
Elapsed time is 0.038272 seconds.

This is a little faster, but still 696x slower than using Map.


I've been thinking about this some more, and I've decided to profile the speed of re-generating some of the data on the fly from a single key lookup - it might be faster than a 3 key lookup, given the potential problems with this approach.

like image 438
Alex L Avatar asked Oct 15 '12 07:10

Alex L


1 Answers

I'm guessing this question is related to your previous question about tetrahedron faces. I still suggest you should try the sparse storage and sparse matrix-vector multiplication for that purpose:

size(spA)
ans =

 1244810     1244810

tic;
vv = spA*v;
idx = find(vv);
toc;

Elapsed time is 0.106581 seconds.

This is just timing analysis, see my previous answer about how to implement it in your case. Before you move to CUDA and do complicated stuff, check out simpler options.

like image 91
angainor Avatar answered Oct 19 '22 09:10

angainor