I have a function that reads ~10000 words into a vector, I then want to group all the words into a map to 'count' how many times a certain word appears.
While the code 'works' it can sometimes take 2 seconds to re-build the map.
NB: Unfortunately, I cannot change the 'read' function, I have to work with the vector of std::u16string
.
std::vector<std::u16string> vValues;
vValues.push_back( ... )
...
std::map<std::u16string, int> mValues;
for( auto it = vValues.begin(); it != vValues.end(); ++it )
{
if( mValues.find( *it ) == mValues.end() )
{
mValues[*it] = 1;
}
else
{
++mValues[*it];
}
}
How could I speed up the 'group by' while keeping track of the number of times the word appears in the vector?
If you call std::map::operator[]
on a new key, the value of the key will be value initialized (to 0 for PODs like int
). So, your loop can be simplified to:
for (auto it = vValues.begin(); it != vValues.end(); ++it)
++mValues[*it];
If there is no key *it
, then the default value will be 0
, but then it is incremented immediately, and it becomes 1
.
If the key already exists, then it is simply incremented.
Furthermore, it doesn't look like you need the map to be ordered, so you can use a std::unordered_map
instead, as insertion is average constant time, instead of logarithmic, which would speed it up even further.
std::vector<std::u16string> vValues;
vValues.push_back( ... )
...
std::sort( vValues.begin(), vValues.end() );
struct counted {
std::u16string value;
std::size_t count;
};
std::vector<counted> result;
auto it = vValues.begin();
while (it != vValues.end()) {
auto r = std::equal_range( it, vValues.end(), *it );
result.push_back({ *it, r.second-r.first });
it = r.second;
}
After this is done, result
will contain {value, count}
for each value and will be sorted.
As all work was done in contiguous containers, it should be faster than your implementation.
If you aren't allowed to mutate vValues
, one thing you could do is create a vector of gsl::span<char16_t>
from it then sort that, then create the result
vector similarly. (If you don't have gsl::span
, write one, they aren't hard to write)
Failing that, even copying result
once may be faster than your original solution.
Using a gsl::span<char16_t const>
in counted
would save some allocations as well (reuse the storage within the vValues
, at the cost of tying their lifetimes together.
One serious concern is that if your strings are extremely long, determining that two strings are equal is expensive. And if they have common prefixes, determining they are different can be expensive. We do log(n) comparisons per distinct element in the equal_range
code, and n log(n) in the sort; sometimes sorting (hash of string, string) pairs can be faster than sorting (string)s alone, as it makes unlike strings easy to detect.
Live example with 4 different versions. Simply change the test1
to test2
or test3
or test4
.
test3
is fastest in every test I did:
std::unordered_map<std::string, int> test3(std::vector<std::string> vValues)
{
std::unordered_map<std::string, int> mValues;
for( auto it = vValues.begin(); it != vValues.end(); ++it )
{
++mValues[std::move(*it)];
}
return mValues;
}
than all the other versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With