Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing between set<int> vs. vector<bool> vs. vector<boolean_t> to use as a bitmap (bitset / bit array)

Given a range of indexes (identifiers), where I want to map each index to a boolean value, that is:

// interface pseudocode
interface bitmap {
  bool identifier_is_set(unsigned int id_idx) const;
  void set_identifier(unsigned int id_idx, bool val) const;
};

so that I can set and query for each ID (index) if it is set or not, what would you prefer to use to implement this?

I think this is called a bit array or bitmap or bitset, correct me if I'm wrong.

Assume that the maximum identifier is predetermined and not greater than 1e6 (1m), possibly much smaller (10k - 100k). (Which means the size used by sizeof(int)*maximum_id_idx easily fits into memory.)

Possible solutions I see so far:

  • std::set<size_t> - Add or erase the identifier to this set as neccessary. This would allow for arbitrarily large identifiers as long as we have a sparse bitmap.
  • std::vector<bool> - Sized to the appropriate maximum value, storing true or false for each id_idx.
  • std::vector<char> - Same thing, but not suffering from weird std::vector<bool> problems. Uses less memory than vector<int>.
  • std::vector<int> - Using an int as the boolean flag to have a container using the natural word size of the machine. (No clue if that could make a difference.)

Please answer which container type you would prefer and why, given the maximum id restriction cited above and especially considering performance aspects of querying the bitmap (inserting performance does not matter).

Note: The interface usage of vector vs. set does not matter, as it will be hidden behind it's wrapping class anyway.

EDIT: To add to the discussion about std::bitset : std::bitset will incorporate the whole array size into the object, that is a sizeof(std::bitset<1m>) will be a size of approx 1/8 megabyte, which makes for a huge single object and makes for something you cannot put on the stack anymore (which may or may not be relevant).

like image 235
Martin Ba Avatar asked Feb 25 '23 20:02

Martin Ba


2 Answers

Without knowing the platform you are running this code on and your access patterns, it's hard to say whether vector<bool> will be faster than vector<char> (or vector<int>) or even set<int> or unordered_set<int>.

For example, if you have an extremely sparse array, a linear search of a vector<int> that just contains the indices set might be the best answer. (See Mike Abrash's article on optimizing Pixomatic for x86.)

On the other hand, you might have a somewhat sparse array. By somewhat sparse, I mean that the number of set elements is much greater than L1 or L2. In that case, more low-level details start to come into play, as well as your actual access patterns.

For example, on some platforms, variable bit shifting is incredibly expensive. So, if you are querying a random set of identifiers, the more frequently you do this, the more a vector<char> or vector<int> becomes a better idea than bitset<...> or vector<bool>. (The latter two use bit shifts to lookup bits.) On the other hand, if you are iterating through the sparse bit vector in order and just want the bits set, you can optimize that iteration to get rid of the overhead of variable shifts.

At this point, you might also want to know how your sparse identifiers are actually distributed. If they are clumped, you need to know the tradeoff between the optimal memory read size and reading a char at a time. That will dictate whether hitting the cache more often will offset reading in non-native sized data.

If the identifiers are scattered, you may get a significant win by using a hash set (unordered_set<int>) instead of a bit vector. That depends on the load, however.

like image 112
MSN Avatar answered Apr 29 '23 14:04

MSN


Have you checked out boost::dynamic_bitset?

http://www.boost.org/doc/libs/1_36_0/libs/dynamic_bitset/dynamic_bitset.html

like image 32
Moo-Juice Avatar answered Apr 29 '23 13:04

Moo-Juice