I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t
, some uint16_t/32/64
. I would like to approximate the count of distinct values in these arrays. The conditions are following:
Examples of perfect output:
ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values
I understand that some of the constraints may seem illogical, but thats the way it is. As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!
EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!
Using sort function() Calculate the length of an array using the length() function that will return an integer value as per the elements in an array. Call the sort function and pass the array and the size of an array as a parameter. Take a temporary variable that will store the count of distinct elements.
Each element e of the data stream is uniformly and independently hashed to an index in the bit vector, and the corresponding bit is set to 1. When a query is made, the number of distinct elements is estimated as m ln (n/m) where m is the number of bits in B that are still 0.
1) sort the vector using quick sort or merge sort, and then iterate over the sorted vector, counting up each time you encounter a value different from current value. 2) set up a std::vector<bool> of size 1,000,000 and put in true values as you iterate over your array. afterwards you count the number of true values.
For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.
For larger values, if you are not interested in counts above 100000, std::map
is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With