Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to approximate the count of distinct values in an array in a single pass through it

I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t, some uint16_t/32/64. I would like to approximate the count of distinct values in these arrays. The conditions are following:

  1. speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)
  2. I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.
  3. I have quite a bit of memory that I can use, but the less is used the better
  4. the smaller the array data type, the more accurate I need to be
  5. I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry)
  6. if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)

Examples of perfect output:

ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values

I understand that some of the constraints may seem illogical, but thats the way it is. As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!

EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!

like image 921
PeterK Avatar asked Jan 18 '12 12:01

PeterK


People also ask

How do you count the number of distinct elements in an array?

Using sort function() Calculate the length of an array using the length() function that will return an integer value as per the elements in an array. Call the sort function and pass the array and the size of an array as a parameter. Take a temporary variable that will store the count of distinct elements.

How do you count distinct elements in a stream?

Each element e of the data stream is uniformly and independently hashed to an index in the bit vector, and the corresponding bit is set to 1. When a query is made, the number of distinct elements is estimated as m ln (n/m) where m is the number of bits in B that are still 0.

How do you count unique values in C++?

1) sort the vector using quick sort or merge sort, and then iterate over the sorted vector, counting up each time you encounter a value different from current value. 2) set up a std::vector<bool> of size 1,000,000 and put in true values as you iterate over your array. afterwards you count the number of true values.


1 Answers

For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.

For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.

like image 127
TonyK Avatar answered Oct 05 '22 23:10

TonyK