Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast 'group by/count' std::vector<std::u16string> into a std::map<u16string, int>

Tags:

c++

c++11

vector

I have a function that reads ~10000 words into a vector, I then want to group all the words into a map to 'count' how many times a certain word appears.

While the code 'works' it can sometimes take 2 seconds to re-build the map.

NB: Unfortunately, I cannot change the 'read' function, I have to work with the vector of std::u16string.

std::vector<std::u16string> vValues;
vValues.push_back( ... )
...

std::map<std::u16string, int> mValues;
for( auto it = vValues.begin(); it != vValues.end(); ++it )
{
  if( mValues.find( *it ) == mValues.end() )
  {
    mValues[*it] = 1;
  }
  else
  {
    ++mValues[*it];
  }
}

How could I speed up the 'group by' while keeping track of the number of times the word appears in the vector?

like image 267
Simon Goodman Avatar asked Sep 15 '25 08:09

Simon Goodman


2 Answers

If you call std::map::operator[] on a new key, the value of the key will be value initialized (to 0 for PODs like int). So, your loop can be simplified to:

for (auto it = vValues.begin(); it != vValues.end(); ++it)
    ++mValues[*it];

If there is no key *it, then the default value will be 0, but then it is incremented immediately, and it becomes 1.

If the key already exists, then it is simply incremented.

Furthermore, it doesn't look like you need the map to be ordered, so you can use a std::unordered_map instead, as insertion is average constant time, instead of logarithmic, which would speed it up even further.

like image 167
Rakete1111 Avatar answered Sep 17 '25 21:09

Rakete1111


std::vector<std::u16string> vValues;
vValues.push_back( ... )
...

std::sort( vValues.begin(), vValues.end() );
struct counted {
  std::u16string value;
  std::size_t count;
};
std::vector<counted> result;
auto it = vValues.begin();
while (it != vValues.end()) {
  auto r = std::equal_range( it, vValues.end(), *it );
  result.push_back({ *it, r.second-r.first });
  it = r.second;
}

After this is done, result will contain {value, count} for each value and will be sorted.

As all work was done in contiguous containers, it should be faster than your implementation.

If you aren't allowed to mutate vValues, one thing you could do is create a vector of gsl::span<char16_t> from it then sort that, then create the result vector similarly. (If you don't have gsl::span, write one, they aren't hard to write)

Failing that, even copying result once may be faster than your original solution.

Using a gsl::span<char16_t const> in counted would save some allocations as well (reuse the storage within the vValues, at the cost of tying their lifetimes together.

One serious concern is that if your strings are extremely long, determining that two strings are equal is expensive. And if they have common prefixes, determining they are different can be expensive. We do log(n) comparisons per distinct element in the equal_range code, and n log(n) in the sort; sometimes sorting (hash of string, string) pairs can be faster than sorting (string)s alone, as it makes unlike strings easy to detect.

Live example with 4 different versions. Simply change the test1 to test2 or test3 or test4.

test3 is fastest in every test I did:

std::unordered_map<std::string, int> test3(std::vector<std::string> vValues)
{
  std::unordered_map<std::string, int> mValues;
  for( auto it = vValues.begin(); it != vValues.end(); ++it )
  {
    ++mValues[std::move(*it)];
  }
  return mValues;
}

than all the other versions.

like image 37
Yakk - Adam Nevraumont Avatar answered Sep 17 '25 21:09

Yakk - Adam Nevraumont