Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partitioning/batch/chunk a container into equal sized pieces using std algorithms

Tags:

c++

c++11

stl

boost

I came across a situation where I had to batch process a set of records off to a database. I am wondering how I could accomplish this with std algorithms.

Given 10002 records I want partition it into bins of 100 records for processing, with the remainder being a bin of 2.

Hopefully the following code will better illustrate what I'm trying to accomplish. I'm completely open to solutions involving iterators, lambdas any sort of modern C++ fun.

#include <cassert>
#include <vector>
#include <algorithm>

template< typename T >
std::vector< std::vector< T > > chunk( std::vector<T> const& container, size_t chunk_size )
{
  return std::vector< std::vector< T > >();
}

int main()
{
  int i = 0;
  const size_t test_size = 11;
  std::vector<int> container(test_size);
  std::generate_n( std::begin(container), test_size, [&i](){ return ++i; } );

  auto chunks = chunk( container, 3 );

  assert( chunks.size() == 4 && "should be four chunks" );
  assert( chunks[0].size() == 3 && "first several chunks should have the ideal chunk size" );
  assert( chunks.back().size() == 2 && "last chunk should have the remaining 2 elements" );

  return 0;
}
like image 712
Zac Avatar asked Jan 09 '13 01:01

Zac


2 Answers

In the context of parallelization, where this sort of partitioning of ranges is common, I found it to be useful to make the notion of a range, i.e. a pair (from,to) the smallest unit passed between the various routines and threads of the process.

It's better than making a copy of an entire sub-vector for each portion because it requires a lot less memory space. It is also more practical than maintaining single end-iterators only, because each range can be passed as-is to a thread – there are no special cases when it's the very first or very last portion etc.

With this in mind, the following is a simplified version of a routine I found to work well, and it's all modern C++11:

#include <cassert>
#include <vector>
#include <utility>
#include <algorithm>
#include <cstdint>

template <typename It>
std::vector<std::pair<It,It>>
  chunk(It range_from, It range_to, const std::ptrdiff_t num)
{
  /* Aliases, to make the rest of the code more readable. */
  using std::vector;
  using std::pair;
  using std::make_pair;
  using std::distance;
  using diff_t = std::ptrdiff_t;

  /* Total item number and portion size. */
  const diff_t total
  { distance(range_from,range_to) };
  const diff_t portion
  { total / num };

  vector<pair<It,It>> chunks(num);

  It portion_end
  { range_from };

  /* Use the 'generate' algorithm to create portions. */    
  std::generate(begin(chunks),end(chunks),[&portion_end,portion]()
        {
          It portion_start
          { portion_end };

          portion_end += portion;
          return make_pair(portion_start,portion_end);
        });

  /* The last portion's end must always be 'range_to'. */    
  chunks.back().second = range_to;

  return chunks;
}

int main()
{
  using std::distance;

  int i = 0;
  const size_t test_size = 11;
  std::vector<int> container(test_size);
  std::generate_n( std::begin(container), test_size, [&i](){ return ++i; } );

  /* This is how it's used: */    
  auto chunks = chunk(begin(container),end(container),3);

  assert( chunks.size() == 3 && "should be three chunks" );
  assert( distance(chunks[0].first,chunks[0].second) == 3 && "first several chunks should have the ideal chunk size" );
  assert( distance(chunks[2].first,chunks[2].second) == 5 && "last chunk should have 5 elements" );

  return 0;
}

It works slightly different from what you proposed: The portion size is always rounded down, so you get only 3 portions in your example, and the final portion is slightly larger (not slightly smaller) than the others. This can be easily modified (I don't think it matters very much because typically the number of portions is much smaller than the total number of work items).


Remark. In my own use of range-related patterns, it quickly turned out that actually storing integers (each indicating the distance from .begin()) rather than iterators is often better. The reason is that converting between these integers and actual iterators is a quick and harmless operation, and can be performed regardless of whether you need iterator or const_iterator. Whereas, when you store iterators, you need to decide once and for all whether you work with iterator or const_iterator, which can be a pain.

like image 114
jogojapan Avatar answered Nov 15 '22 15:11

jogojapan


The problem seems to be a variation on std::for_each, where the "each" you want to operate on is an interval of your collection. Thus you would prefer to write a lambda (or function) that takes two iterators defining the start and end of each interval and pass that lambda/function to your algorithm.

Here's what I came up with...

// (Headers omitted)

template < typename Iterator >
void for_each_interval(
    Iterator begin
  , Iterator end
  , size_t interval_size
  , std::function<void( Iterator, Iterator )> operation )
{
  auto to = begin;

  while ( to != end )
  {
    auto from = to;

    auto counter = interval_size;
    while ( counter > 0 && to != end )
    {
      ++to;
      --counter;
    }

    operation( from, to );
  }
}

(I wish that std::advance would take care of the inner loop that uses counter to increment to, but unfortunately it blindly steps beyond the end [I'm tempted to write my own smart_advance template to encapsulate this]. If that would work, it would reduce the amount of code by about half!)

Now for some code to test it...

// (Headers omitted)

int main( int argc, char* argv[] )
{
  // Some test data
  int foo[10] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
  std::vector<int> my_data( foo, foo + 10 );
  size_t const interval = 3;

  typedef decltype( my_data.begin() ) iter_t;
  for_each_interval<iter_t>( my_data.begin(), my_data.end(), interval,
    []( iter_t from, iter_t to )
    {
      std::cout << "Interval:";
      std::for_each( from, to,
        [&]( int val )
        {
          std::cout << " " << val;
        } );
      std::cout << std::endl;
    } );
}

This produces the following output, which I think represents what you want:

Interval: 0 1 2
Interval: 3 4 5
Interval: 6 7 8
Interval: 9
like image 40
aldo Avatar answered Nov 15 '22 15:11

aldo