Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to read/write a std::unordered_map from/to a file

I am working with some very large std::unordered_maps (hundreds of millions of entries) and need to save and load them to and from a file. The way I am currently doing this is by iterating through the map and reading/writing each key and value pair one at a time:

std::unordered_map<unsigned long long int, char> map;

void save(){
    std::unordered_map<unsigned long long int, char>::iterator iter;
    FILE *f = fopen("map", "wb");
    for(iter=map.begin(); iter!=map.end(); iter++){
        fwrite(&(iter->first), 8, 1, f);
        fwrite(&(iter->second), 1, 1, f);
    }
    fclose(f);
}

void load(){
    FILE *f = fopen("map", "rb");
    unsigned long long int key;
    char val;
    while(fread(&key, 8, 1, f)){
        fread(&val, 1, 1, f);
        map[key] = val;
    }
    fclose(f);
}

But with around 624 million entries, reading the map from a file took 9 minutes. Writing to a file was faster but still took several minutes. Is there a faster way to do this?

like image 850
Ben Avatar asked Jan 23 '18 19:01

Ben


1 Answers

C++ unordered_map implementations must all use chaining. There are a variety of really good reasons why you might want to do this for a general purpose hash table, which are discussed here.

This has enormous implications for performance. Most importantly, it means that the entries of the hash table are likely to be scattered throughout memory in a way which makes accessing each one an order of magnitude (or so) less efficient that would be the case if they could somehow be accessed serially.

Fortunately, you can build hash tables that, when nearly full, give near-sequential access to adjacent elements. This is done using open addressing.

Since your hash table is not general purpose, you could try this.

Below, I've built a simple hash table container with open addressing and linear probing. It assumes a few things:

  1. Your keys are already somehow randomly distributed. This obviates the need for a hash function (though decent hash functions are fairly simple to build, even if great hash functions are difficult).

  2. You only ever add elements to the hash table, you do not delete them. If this were not the case you'd need to change the used vector into something that could hold three states: USED, UNUSED, and TOMBSTONE where TOMBSTONE is the stated of a deleted element and used to continue linear search probe or halt a linear insert probe.

  3. That you know the size of your hash table ahead of time, so you don't need to resize/rehash it.

  4. That you don't need to traverse your elements in any particular order.

Of course, there are probably all kinds of excellent implementations of open addressing hash tables online which solve many of the above issues. However, the simplicity of my table allows me to convey the important point.

The important point is this: my design allows all the hash table's information to be stored in three vectors. That is: the memory is contiguous.

Contiguous memory is fast to allocate, fast to read from, and fast to write to. The effect of this is profound.

Using the same test setup as my previous answer, I get the following times:

Save. Save time = 82.9345 ms
Load. Load time = 115.111 ms

This is a 95% decrease in save time (22x faster) and a 98% decrease in load time (62x faster).

Code:

#include <cassert>
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <functional>
#include <iostream>
#include <random>
#include <vector>

const int TEST_TABLE_SIZE = 10000000;



template<class K, class V>
class SimpleHash {
 public:
  int usedslots = 0;

  std::vector<K> keys;
  std::vector<V> vals;
  std::vector<uint8_t> used;

  //size0 should be a prime and about 30% larger than the maximum number needed
  SimpleHash(int size0){
    vals.resize(size0);
    keys.resize(size0);
    used.resize(size0/8+1,0);
  }

  //If the key values are already uniformly distributed, using a hash gains us
  //nothing
  uint64_t hash(const K key){
    return key;
  }

  bool isUsed(const uint64_t loc){
    const auto used_loc = loc/8;
    const auto used_bit = 1<<(loc%8);
    return used[used_loc]&used_bit;    
  }

  void setUsed(const uint64_t loc){
    const auto used_loc = loc/8;
    const auto used_bit = 1<<(loc%8);
    used[used_loc] |= used_bit;
  }

  void insert(const K key, const V val){
    uint64_t loc = hash(key)%keys.size();

    //Use linear probing. Can create infinite loops if table too full.
    while(isUsed(loc)){ loc = (loc+1)%keys.size(); }

    setUsed(loc);
    keys[loc] = key;
    vals[loc] = val;
  }

  V& get(const K key) {
    uint64_t loc = hash(key)%keys.size();

    while(true){
      if(!isUsed(loc))
        throw std::runtime_error("Item not present!");
      if(keys[loc]==key)
        return vals[loc];

      loc = (loc+1)%keys.size();
    }
  }

  uint64_t usedSize() const {
    return usedslots;
  }

  uint64_t size() const {
    return keys.size();
  }
};

typedef SimpleHash<uint64_t, char> table_t;

void SaveSimpleHash(const table_t &map){
  std::cout<<"Save. ";
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "wb");
  uint64_t size = map.size();
  fwrite(&size, 8, 1, f);
  fwrite(map.keys.data(), 8, size, f);
  fwrite(map.vals.data(), 1, size, f);
  fwrite(map.used.data(), 1, size/8+1, f);
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;
}

table_t LoadSimpleHash(){
  std::cout<<"Load. ";
  const auto start = std::chrono::steady_clock::now();
  FILE *f = fopen("/z/map", "rb");

  uint64_t size;
  fread(&size, 8, 1, f);

  table_t map(size);
  fread(map.keys.data(), 8, size, f);
  fread(map.vals.data(), 1, size, f);
  fread(map.used.data(), 1, size/8+1, f);
  fclose(f);
  const auto end = std::chrono::steady_clock::now();
  std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl;

  return map;
}

int main(){
  //Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity
  auto generator = std::mt19937(12345); //Combination of my luggage
  //Generate values within the specified closed intervals
  auto key_rand  = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator);
  auto val_rand  = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator);

  table_t map(1.3*TEST_TABLE_SIZE);
  std::cout<<"Created table of size "<<map.size()<<std::endl;

  std::cout<<"Generating test data..."<<std::endl;
  for(int i=0;i<TEST_TABLE_SIZE;i++)
    map.insert(key_rand(),(char)val_rand()); //Low chance of collisions, so we get quite close to the desired size

  map.insert(23,42);
  assert(map.get(23)==42);

  SaveSimpleHash(map);
  auto newmap = LoadSimpleHash();

  //Ensure that the load worked
  for(int i=0;i<map.keys.size();i++)
    assert(map.keys.at(i)==newmap.keys.at(i));
  for(int i=0;i<map.vals.size();i++)
    assert(map.vals.at(i)==newmap.vals.at(i));  
  for(int i=0;i<map.used.size();i++)
    assert(map.used.at(i)==newmap.used.at(i));    
}
like image 96
Richard Avatar answered Oct 09 '22 13:10

Richard