I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are <code>uint8_t</code>, some <code>uint16_t/32/64</code>. I would like to approximate the count of distinct values in these arrays. The conditions are following: <ol> <li>speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)</li> <li>I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.</li> <li>I have quite a bit of memory that I can use, but the less is used the better</li> <li>the smaller the array data type, the more accurate I need to be </li> <li>I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry) </li> <li>if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)</li> </ol> Examples of perfect output: <pre class="prettyprint"><code>ArrayA [uint32_t, 3M members]: ~128 distinct values ArrayB [uint32_t, 9M members]: 100000+ distinct values ArrayC [uint8_t, 50K members]: 2-5 distinct values ArrayD [uint8_t, 700K members]: 64+ distinct values </code></pre> I understand that some of the constraints may seem illogical, but thats the way it is. As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them! EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!

For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found. For larger values, if you are not interested in counts above 100000, <code>std::map</code> is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.

How to approximate the count of distinct values in an array in a single pass through it

Tags:

c++

arrays

algorithm

search

I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t, some uint16_t/32/64. I would like to approximate the count of distinct values in these arrays. The conditions are following:

speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)
I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.
I have quite a bit of memory that I can use, but the less is used the better
the smaller the array data type, the more accurate I need to be
I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry)
if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)

Examples of perfect output:

ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values

I understand that some of the constraints may seem illogical, but thats the way it is. As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!

EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!

921

asked Jan 18 '12 12:01

PeterK

1 Answers

For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.

For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.

127

answered Oct 05 '22 23:10

TonyK

Related questions
                            
                                Should I make my local variables const or movable?
                            
                                Is it valid to create closure (lambda) objects using `std::bit_cast` in C++20?
                            
                                How to automatically sort a QTreeWidget column?
                            
                                What is the difference between accessing vector elements using an iterator vs an index?
                            
                                C/C++: Calling function with no arguments with function which returns nothing
                            
                                How to properly use references with variadic templates
                            
                                Are there any modern platforms with non-IEEE C/C++ float formats?
                            
                                Make All Types Constant by Default in C++
                            
                                Need help in reading callgrind output
                            
                                C++ iterator and const_iterator problem for own container class
                            
                                What's the point of using boost::mem_fn if we have boost::bind?
                            
                                Synchronizing access to a return value
                            
                                Get the list of methods of a class
                            
                                enable_shared_from_this and objects on stack
                            
                                Finding "~/Library/Application Support" from C++?
                            
                                What book would cover theory for 3D game development mathematics? [closed]
                            
                                "noexcept" vs "Throws: nothing" [closed]
                            
                                Configure gtest to show failed test only in console
                            
                                Once an array-of-T has decayed into a pointer-to-T, can it ever be made into an array-of-T again?
                            
                                Is it safe to cast arbitrary values of the underlying type to a strongly-typed enum type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With