I'd like to look up 3 integers (i.e. [1 2 3]) in a large data set of around a million points.
I'm currently using MATLAB's Map (hashmap), and for each point I'm doing the following:
key = sprintf('%d ', [1 2 3]); % 23 us
% key = '1 2 3 '
result = lookup_map( key ); % 32 us
This is quite time consuming though - 1 million points * 55 us = 55 seconds.
I'd like to move this to the GPU using CUDA, but I'm not sure of the best way of approaching this.
I could transfer four arrays - key1, key2, key3, result
, and then perform binary search on the keys, but this would take 20 iterations (2^20 = 1048576) per key. Then I'd also have delays due to concurrent memory access from each thread.
Is there a data structure optimised for parallel (O(1), ideally) multiple key lookups in CUDA?
Q: What are the bounds of the three integers? And what data is looked up?
The integer keys can be between 0 and ~75,000 currently, but may be bigger (200,000+) in the future.
For the purposes of this question, we can assume that the result
is an integer between 0 and the size of the data set.
Q: Why don't you pack all three numbers into one 64bit number (21 bits per number gives you a range of 0-2,097,152). And use that to index into a sparse array?
>> A = uint64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'uint64'.
>> A = int64(ones(10));
>> sparse_A = sparse(A)
??? Undefined function or method 'sparse' for input arguments of type 'int64'.
It appears that my matlab doesn't support sparse arrays of 64-bit numbers.
In case this helps anyone else, I wrote a quick function to create a 64-bit key from three <2^21 unsigned integers:
function [key] = to_key(face)
key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1));
end
Q: From @Dennis - why not use logical indexing?
Let's test it!
% Generate a million random integers between 0 and 1000
>> M = int32(floor(rand(10000000,4)*1000));
% Find a point to look for
>> search = M(500000,1:3)
search =
850 910 581
>> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc;
Elapsed time is 0.089801 seconds.
>> M(idx,:)
ans =
850 910 581 726
Unfortunately this takes 89801us, which is 1632x slower than my existing solution (55us)! It would take 2.5 hours to run this a million times!
We could try filtering M
after each search:
>> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc;
Elapsed time is 0.038272 seconds.
This is a little faster, but still 696x slower than using Map.
I've been thinking about this some more, and I've decided to profile the speed of re-generating some of the data on the fly from a single key lookup - it might be faster than a 3 key lookup, given the potential problems with this approach.
I'm guessing this question is related to your previous question about tetrahedron faces. I still suggest you should try the sparse
storage and sparse matrix-vector multiplication for that purpose:
size(spA)
ans =
1244810 1244810
tic;
vv = spA*v;
idx = find(vv);
toc;
Elapsed time is 0.106581 seconds.
This is just timing analysis, see my previous answer about how to implement it in your case. Before you move to CUDA and do complicated stuff, check out simpler options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With