Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

comprehensive way to check for functions that use the random number generator in an R script?

is there a smart way to identify all functions that use .Random.seed (the random number generator state within R) at any point in an R script?

use case: we have a dataset that changes constantly, both the records [rows] and the information [columns] - we add new records often, but we also update information in certain columns. so the dataset is constantly in flux. we fill in some missing data with an imputation, which requires random number generation with the sample() function. so whenever we add a new row or update any information in the column, the randomly imputed numbers all change -- which is expected. we use set.seed() at the start of each random imputation, so if a column changes but zero rows change, the other randomly-generated columns are not affected.

i'm under the impression that the only function within our entire codebase that ever touches a random seed is the sample() function, but i would like to verify this somehow?

edit: even something that prints a function call whenever the random number state gets touched would be helpful, the same way debug() comes to life whenever the debugged function gets triggered? for our purposes, it is pretty safe to assume that if we run our script once for dynamic evaluation and no other random functions get triggered, then we are safe.

thanks

like image 206
Anthony Damico Avatar asked Apr 26 '17 15:04

Anthony Damico


People also ask

Which random number generator does r use?

Random Number Generators The default algorithm in R is Mersenne-Twister but a long list of methods is available.

How does random number generator work in R?

The generator takes that seed value and then generates numbers that “look” random. The catch: if you give the random number generator the same seed value, it gives the same pseudorandom values.

How do I generate random data in R?

To do this, use the set. seed() function. Using set. seed() will force R to produce consistent random samples at any time on any computer.

How do I generate the same random number in R?

seed() Function. set. seed() function in R Language is used to create random numbers which can be reproduced. It helps in creating same random numbers each time a random function is called.


1 Answers

Notwithstanding my comment, here’s a brute force way of checking this:

rm(.Random.seed) # if it already exists
makeActiveBinding('.Random.seed',
                  function () stop('Something touched my seed', call. = FALSE),
                  globalenv())

This will make .Random.seed into an active binding that throws an error when it’s touched.

This works but it’s very disruptive. Here’s a gentler variant. It has a few interesting features:

  • It allows enabling and disabling debugging of .Random.seed
  • It supports getting and setting the seed
  • It logs the call but doesn’t stop execution
  • It maintains a “whitelist” of top-level calls that shouldn’t be logged

With this you can write the following code, for instance:

# Ignore calls coming from sample.int
> debug_random_seed(ignore = sample.int)

> sample(5)
Getting .Random.seed
Called from sample(5)
Setting .Random.seed
Called from sample(5)
[1] 3 5 4 1 2

> sample.int(5)
[1] 5 1 2 4 3

> undebug_random_seed()

> sample(5)
[1] 2 1 5 3 4

Here is the implementation in all its glory:

debug_random_seed = local({
    function (ignore) {
        seed_scope = parent.env(environment())

        if (is.function(ignore)) ignore = list(ignore)

        if (exists('.Random.seed', globalenv())) {
            if (bindingIsActive('.Random.seed', globalenv())) {
                warning('.Random.seed is already being debugged')
                return(invisible())
            }
        } else {
            set.seed(NULL)
        }

        # Save existing seed before deleting
        assign('random_seed', .Random.seed, seed_scope)
        rm(.Random.seed, envir = globalenv())

        debug_seed = function (new_value) {
            if (sys.nframe() > 1 &&
                ! any(vapply(ignore, identical, logical(1), sys.function(1)))
            ) {
                if (missing(new_value)) {
                    message('Getting .Random.seed')
                } else {
                    message('Setting .Random.seed')
                }
                message('Called from ', deparse(sys.call(1)))
            }

            if (! missing(new_value)) {
                assign('random_seed', new_value, seed_scope)
            }

            random_seed
        }

        makeActiveBinding('.Random.seed', debug_seed, globalenv())
    }
})

undebug_random_seed = function () {
    if (! (exists('.Random.seed', globalenv()) &&
           bindingIsActive('.Random.seed', globalenv()))) {
        warning('.Random.seed is not being debugged')
        return(invisible())
    }

    seed = suppressMessages(.Random.seed)
    rm('.Random.seed', envir = globalenv())
    assign('.Random.seed', seed, globalenv())
}

Some notes about the code:

  • The debug_random_seed function is defined inside its own private environment. This environment is designated by seed_scope in the code. This prevents leaking the private random_seed variable into the global environment.
  • The function defensively checks whether debugging is already enabled. Overkill maybe.
  • Debug information is only printed when the seed is accessed within a function call. If the user inspects .Random.seed directly on the R console, no logging occurs.
like image 147
Konrad Rudolph Avatar answered Oct 24 '22 21:10

Konrad Rudolph