Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which container is most efficient for multiple insertions / deletions in C++?

Tags:

c++

containers

I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:

// Store a collection of integers
class IntegerCollection {
public:
  // Insert one entry with value x
  void Insert(int x);

  // Erase one entry with value x, if one exists
  void Erase(int x);

  // Erase all entries, x, from <= x < to
  void Erase(int from, int to);

  // Return the count of all entries, x, from <= x < to
  size_t Count(int from, int to) const;

The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.

The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.

Here is my code for the four functions using a vector:

// Previous bit of code shown goes here 

private:
  std::vector<int> integerCollection;
};

void IntegerCollection::Insert(int x) {

  /* using lower_bound to find the right place for x to be inserted
  keeps the vector sorted and makes life much easier */
  auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
  integerCollection.insert(it, x);
}

void IntegerCollection::Erase(int x) {

  // find the location of the first element containing x and delete if it exists
  auto it = std::find(integerCollection.begin(), integerCollection.end(), x);

  if (it != integerCollection.end()) {
    integerCollection.erase(it);
  }

}

void IntegerCollection::Erase(int from, int to) {

  if (integerCollection.empty()) return;

  // lower_bound points to the first element of integerCollection >= from/to
  auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
  auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);

  /* std::vector::erase deletes entries between the two pointers
  fromBound (included) and toBound (not indcluded) */
  integerCollection.erase(fromBound, toBound);

}

size_t IntegerCollection::Count(int from, int to) const {

  if (integerCollection.empty()) return 0;

  int count = 0;

  // lower_bound points to the first element of integerCollection >= from/to
  auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
  auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);

  // increment pointer until fromBound == toBound (we don't count elements of value = to)
  while (fromBound != toBound) {
    ++count; ++fromBound;
  }

  return count;

}

The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!

I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?

like image 711
ralphweinstein Avatar asked Jun 04 '19 19:06

ralphweinstein


3 Answers

We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?

The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either

  • roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
  • std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
  • std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.

Let's compare only isertion and erasure assuming the container needs to stay sorted:

   operation         std::set          std::vector
   insert            log(N)            N
   erase             log(N)            N

Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.

Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.

I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.

like image 105
463035818_is_not_a_number Avatar answered Oct 16 '22 05:10

463035818_is_not_a_number


Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.

like image 32
Toothpick Anemone Avatar answered Oct 16 '22 06:10

Toothpick Anemone


The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.

The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.

Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.

If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.

IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);

Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:

indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;

int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
   countTotal += containers[i].count();
}
like image 23
icodebot Avatar answered Oct 16 '22 04:10

icodebot