I have a large(ish - >100K) collection mapping a user identifier (an int) to the count of different products that they've bought (also an int.) I need to re-organise the data as efficiently as possible to find how many users have different numbers of products. So for example, how many users have 1 product, how many users have two products etc.
I have acheived this by reversing the original data from a std::map
into a std::multimap
(where the key and value are simply reversed.) I can then pick out the number of users having N products using count(N)
(although I also uniquely stored the values in a set so I could be sure of the exact number of values I was iterating over and their order)
Code looks like this:
// uc is a std::map<int, int> containing the original
// mapping of user identifier to the count of different
// products that they've bought.
std::set<int> uniqueCounts;
std::multimap<int, int> cu; // This maps count to user.
for ( map<int, int>::const_iterator it = uc.begin();
it != uc.end(); ++it )
{
cu.insert( std::pair<int, int>( it->second, it->first ) );
uniqueCounts.insert( it->second );
}
// Now write this out
for ( std::set<int>::const_iterator it = uniqueCounts.begin();
it != uniqueCounts.end(); ++it )
{
std::cout << "==> There are "
<< cu.count( *it ) << " users that have bought "
<< *it << " products(s)" << std::endl;
}
I just can't help feeling that this is not the most efficient way of doing this. Anyone know of a clever method of doing this?
I'm limited in that I can't use Boost or C++11 to do this.
Oh, also, in case anyone is wondering, this is neither homework, nor an interview question.
Assuming you know the maximum number of products that a single user could have bought, you might see better performance just using a vector to store the results of the operation. As it is you're going to need an allocation for pretty much every entry in the original map, which likely isn't the fastest option.
It would also cut down on the lookup overhead on a map, gain the benefits of memory locality, and replace the call to count on the multimap (which is not a constant time operation) with a constant time lookup of the vector.
So you could do something like this:
std::vector< int > uniqueCounts( MAX_PRODUCTS_PER_USER );
for ( map<int, int>::const_iterator it = uc.begin();
it != uc.end(); ++it )
{
uniqueCounts[ uc.second ]++;
}
// Now write this out
for ( int i = 0, std::vector< int >::const_iterator it = uniqueCounts.begin();
it != uniqueCounts.end(); ++it, ++i )
{
std::cout << "==> There are "
<< *it << " users that have bought "
<< i << " products(s)" << std::endl;
}
Even if you don't know the maximum number of products, it seems like you could just guess a maximum and adapt this code to increase the size of the vector if required. It's sure to result in less allocations than your original example anyway.
All this is assuming that you don't actually require the user ids after you've processed this data of course (and as pointed out in the comments below, that the number of products bought for each user is a relatively small & contiguous set. Otherwise you might be better off using a map in place of a vector - you'll still avoid calling the multimap::count function, but potentially lose some of the other benefits)
It depends on what you mean by "more efficient". First off, is this really a bottle neck? Sure, 100k entries is a lot, but if you only have to this every few minutes, it's ok if the algorithm takes a couple seconds.
The only area for improvement I see is memory usage. If this is a concern, you can skip the generation of the multimap and just keep a counter map around, something like this (beware, my C++ is a little rusty):
std::map<int, int> countFrequency; // count => how many customers with that count
for ( std::map<int, int>::const_iterator it = uc.begin();
it != uc.end(); ++it )
{
// If it->second is not yet in countFrequency,
// the default constructor initializes it to 0.
countFrequency[it->second] += 1;
}
// Now write this out
for ( std::map<int, int>::const_iterator it = countFrequency.begin();
it != countFrequency.end(); ++it )
{
std::cout << "==> There are "
<< it->second << " users that have bought "
<< it->first << " products(s)" << std::endl;
}
If a user is added and buys count
items, you can update countFrequency
with
countFrequency[count] += 1;
If an existing user goes from oldCount
to newCount
items, you can update countFrequency
with
countFrequency[oldCount] -= 1;
countFrequency[newCount] += 1;
Now, just as an aside, I recommend using an unsigned int
for count (unless there's a legitimate reason for negative counts) and typedef'ing a userID
type, for added readability.
If you can, I would recommend keeping both pieces of data current all the time. In other words, I would maintain a second map which is mapping number of products bought to number of customers who bought that many products. This map contains the exact answer to your question if you maintain it. Each time a customer buys a product, let n be the number of products this customer has now bought. Subtract one from the value at key n-1. Add one to the value at key n. If the range of keys is small enough this could be an array instead of a map. Do you ever expect a single customer to buy hundreds of products?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With