Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mersenne twister warm up vs. reproducibility

In my current C++11 project I need to perform M simulations. For each simulation m = 1, ..., M, I randomly generate a data set by using a std::mt19937 object, constructed as follows:

std::mt19937 generator(m);
DatasetFactory dsf(generator);

According to https://stackoverflow.com/a/15509942/1849221 and https://stackoverflow.com/a/14924350/1849221, the Mersenne Twister PRNG benefits from a warm up phase, which is currently absent in my code. I report for convenience the proposed snippet of code:

#include <random>

std::mt19937 get_prng() {
    std::uint_least32_t seed_data[std::mt19937::state_size];
    std::random_device r;
    std::generate_n(seed_data, std::mt19937::state_size, std::ref(r));
    std::seed_seq q(std::begin(seed_data), std::end(seed_data));
    return std::mt19937{q};
}

The problem in my case is that I need reproducibility of results, i.e., among different executions, for each simulation, the data set has to be the same. That's the reason why in my current solution I use the current simulation to seed the Mersenne Twister PRNG. It seems to me that the usage of std::random_device prevents data from being the same (AFAIK, this is the exact purpose of std::random_device).

EDIT: by different executions I mean re-launching the executable.

How can I introduce the afore-mentioned warm up phase in my code without affecting reproducibility? Thanks.

Possible solution #1

Here's a tentative implementation based on the second proposal by @SteveJessop

#include <random>

std::mt19937 get_generator(unsigned int seed) {
        std::minstd_rand0 lc_generator(seed);
        std::uint_least32_t seed_data[std::mt19937::state_size];

        std::generate_n(seed_data, std::mt19937::state_size, std::ref(lc_generator));
        std::seed_seq q(std::begin(seed_data), std::end(seed_data));
        return std::mt19937{q};
    }

Possible solution #2

Here's a tentative implementation based on the joint contribution by @SteveJassop and @AndréNeve. The sha256 function is adapted from https://stackoverflow.com/a/10632725/1849221

#include <openssl/sha.h>
#include <sstream>
#include <iomanip>
#include <random>

 std::string sha256(const std::string str) {
    unsigned char hash[SHA256_DIGEST_LENGTH];
    SHA256_CTX sha256;
    SHA256_Init(&sha256);
    SHA256_Update(&sha256, str.c_str(), str.size());
    SHA256_Final(hash, &sha256);

    std::stringstream ss;
    for(int i = 0; i < SHA256_DIGEST_LENGTH; i++) 
        ss << std::hex << std::setw(2) << std::setfill('0') << (int)hash[i];

    return ss.str();
}

std::mt19937 get_generator(unsigned int seed) {
    std::string seed_str = sha256(std::to_string(seed));
    std::seed_seq q(seed_str.begin(), seed_str.end());
    return std::mt19937{q};
}

Compile with: -I/opt/ssl/include/ -L/opt/ssl/lib/ -lcrypto

like image 719
Ilio Catallo Avatar asked Apr 18 '13 09:04

Ilio Catallo


1 Answers

Two options:

  1. Follow the proposal you have, but instead of using std::random_device r; to generate your seed sequence for MT, use a different PRNG seeded with m. Choose one that doesn't suffer like MT does from needing a warmup when used with small seed data: I suspect an LCG will probably do. For massive overkill, you could even use a PRNG based on a secure hash. This is a lot like "key stretching" in cryptography, if you've heard of that. You could in fact use a standard key stretching algorithm, but you're using it to generate a long seed sequence rather than large key material.

  2. Continue using m to seed your MT, but discard a large constant amount of data before starting the simulation. That is to say, ignore the advice to use a strong seed and instead run the MT long enough for it to reach a decent internal state. I don't know off-hand how much data you need to discard, but I expect the internet does.

like image 75
Steve Jessop Avatar answered Sep 22 '22 06:09

Steve Jessop