Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effective unique on unordered elements

I'd like to improve the speed performance of my box counting method I use in fractal analysis.

About the task

I have a stream of ints (about n=2^24 long) and I have to calculate how many different values are in the stream. There's no upper boundary and negative values are allowed (but the amount of negative values is possibly less than sqrt(n)). There's a small correlation in the stream, i.e. the actual element is likely to be equal or not too far from the previous one. In many cases I have a lot of equal values in the whole range.

Methods I've already tried

vector, sort, uniqe

My first implementation was to put all the elements into a vector and then I applied a std::sort and then std::unique.

The complexity of this method is O(n*log(n)) and I don't think any other algorithm can be faster in general when it comes to scaling. But I'm sure a code must be exists that is faster than this but with the same scaling properties - so faster only with a constant factor. The reasons are:

  1. I have a lot of equal values stored in the vector so sorting is not as effective, the vector is unduly large
  2. In this method I don't use the information that the actual element and the previous are close to each other
  3. I don't need information on what are those unique values, I only need the number of different elements

set, insert, size

To eliminate the first ineffectiveness point, I put every elements into a set with set::insert. And in the end I counted the number of elements with set::size.

My expectation was that this code must be faster because only unique values are stored in the set and it don't have to compare new elements with a lot of equal values. But unfortunately this method was 1.5 times slower than the previous one.

set, emplace_hint, size

To eliminates the second ineffectiveness point too, I not only put every elements into a set, but with the function set::emplace_hint. And every time a gave a hint to put the new element next to the previous one. And in the end, I asked for the size of the set with the set::size

My expectation was that this code must faster the previous one because I can guess the value of the new element and it's better than nothing. But unfortunately this method was 5 times slower then the previous one.

The question

Can you suggest any effective method that can calculates the number of different elements (ints) in a stream? Can you optimize the code if it is known that

  1. there's a measurable correlation in the numbers
  2. some numbers are appear repetadly

The target architecture is modern x86 or x86-64 PC processor (with sse, sse2) and only single thread code is suitable. I prefer not to use boost but c++11.

The solution

First, thanks for the many suggestions, patience and understanding and I'm sorry that I cannot test all of the methods and I'm also sure that effectiveness is depends on the details of the stream of ints what I've not provided. However I share the results I've got with VS2013 compiler. (Code is tested under gcc4.7 but not measured.) This topic worth a lot more time to investigate but I've got a solution that fits my needs. Time statistics for different methods

About the methods:

  • vector of bool: the BitVector solution from Dieter Lücking
  • binary lookup: the method suggested by Tony D
  • unordered set: putting simply all elements into an std::unordered_set, then ask for the number of its elements, as suggested by Ixanezis
  • vector insert sorted: using Dieter Lücking's Sorted Vector approach
  • set insert: the method I've described in the question form
  • radix sort: Ixanezis's suggestion, using a popular sorting algorithm on vector
  • set emplace hint: using std::emplace_hint as described in the question form
like image 258
DanielTuzes Avatar asked Apr 18 '14 09:04

DanielTuzes


1 Answers

Since you only deal with a limited range of integer numbers, the radix sort algorithm can be effectively employed here, reducing the log(N) part of the complexity. You can pick any really fast implementation somewhere in the internet. Some of them require SSE support, others are multithreaded or even coded to be run on GPU.

If you have the possibility to use boost::unordered_set or C++11 std::unordered_set, then your second approach can be easily modified you use it, also resulting in a linear complexity algorithm. However, if you have at least a few millions of numbers in a stream, I believe the first method would be faster.

like image 168
Ixanezis Avatar answered Oct 13 '22 01:10

Ixanezis