Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random enumeration of a hash table in OCaml

Sorry for the long question. I decided to explain the context of the problem first as maybe there are other solutions to my problem. If you're in a hurry, just read THE QUESTION below.

(EDITED -- In the mean time I added some attempts to solve the problem. The fourth one has my final conclusion, you can jump straight to it.)

THE CONTEXT

I have a hash table filled with roughly 20k pairs (key(i),value(i)). I want to generate random lists like this one

[(key(213),value(213));(key(127),value(127));(key(89),value(89));...]

The restriction is that once I've chosen key(213) to be the first element of the list, not all keys can follow it (I have some other function 'decide' which can decide whether some key can be the next one in the list or not). So, I'd like to choose a random next element and check if it's appropriate -- in the example above key(127) was chosen. In the case that that element is rejected by my 'decide' function I'd like to pick another one randomly. But I wouldn't want to pick the same just rejected because I know it will be rejected again and not only would this be inefficient, I also run the risk, if only a few keys can be the next ones, of taking a long time until I find an appropriate key. Note that there can be repetition, such as

[(key(213),value(213));(key(213),value(213));(key(78),value(78));...]

This is OK, as long as the 'decide' function accepts key(213) as the next in the list. So, all I need is a way to randomly enumerate the (key,value) pairs in the hash table. Whenever I have to choose a key, I create an enumeration, which I consume by checking each new element with the 'decide' function (so, no repetitions occur) and when I find one I add it to the list and proceed incrementing the list. The thing is that I don't want this enumeration of the hash table to be the same every time. I want it to be random. (This has to do with the structure of the search space I have in my particular problem which is not relevant here.)

I can of course implement this by generating random integers and using just lists -- that's what I'm currently doing. But, as this is something I run into quite frequently, I wonder if there is some random enumeration facility for hash tables somewhere.

THE QUESTION

Is there some random enumeration function for hash tables somewhere? I am aware of the function BatHashtbl.enum (Batteries library) but I think it will always give me the same enumeration for the same hash table (is this correct?). Also, there doesn't seem to exist anything of the sort in that BatHashtbl module. I'd be interested in something like

random_enum: ('a, 'b) t -> int -> ('a * 'b) Enum.t

which, when provided with a hash table and some integer as a seed for the random generator, will give a different random enumeration of the hash table. Any ideas?

Thanks for any help!

Best, Surikator.

FIRST ATTEMPT

After Niki's suggestion in the comments, and looking in more detail through the Batteries Library, I've come up with this

let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte (* This returns*)
in Array.to_list s

which has the type

val rand_enum : ('a,'b) BatHashtbl.t -> int -> ('a*'b) list

It uses the Fisher-Yates algorithm for the shuffling which runs in O(n). It returns a list instead of an enumeration and that is quite annoying, because it means that even if I'm happy with the third element of the list obtained with rand_enum, the function will still compute a random enumeration for the whole 20k elements in the hash table.

Best, Surikator

SECOND ATTEMPT

I defined the module RndHashtblEnum as

(* Random Hashtable Enumeration Module *)
type ('a,'b) t = {
   ht:('a,'b) BatHashtbl.t;
   mutable ls:('a*'b) list;
   f: (('a,'b) BatHashtbl.t -> ('a*'b) list)}

let shuffle ht =
  let hte = BatHashtbl.enum ht
  in let s = BatRandom.shuffle hte
  in Array.to_list s

let create ht n = (BatRandom.init n; {ht=ht;ls=shuffle ht;f=shuffle})

let rec next re =
match re.ls with
    | [] -> re.ls<-(re.f re.ht);next re
    | h::t -> re.ls<-t; h

It has the new type t for random enumerations of hash tables. This type stores the hash table we wish to enumerate, the list we will be enumerating from and a function to compute a new enumerated list (from the hash table) once the list we have runs out. Once the list has run out, when we ask for a new random element of the hash table, the type t automatically puts a new random list created from the hash table.

So, using the above module, if we want to enumerate a hash table randomly, we simply do:

let re = RndHashtblEnum.create ht 1236

to create a random enumeration of hash table ht with random seed 1236 (In this code I assume the hash table was defined before) and we can then write

let (k,v) = RndHashtblEnum.next re

to get the next (k,v) pair from the random enumeration.

One question we may ask is whether this is actually fair randomness because I use the remaining of the list to randomly enumerate the hash table the next time I need a random enumeration. Well, it isn't. If my hash table has say 1000 elements and after extracting 5 random ones I am satisfied with the result, I know that in the next 995 (of the second set of extractions) none of those 5 elements will be extracted. So, that's not fair randomness. It's even worse. It may well be the case that in the next 1000 extractions (995 from this list, 5 from the next enumerating list) some elements will not be covered. On average, the algorithm is fair, but it isn't fair all the time.

Best, Surikator.

THIRD ATTEMPT

Hi again,

Including Niki's suggestion of using BatArray.enum and a fundamental change in the stochastic part of the algorithm, I have come up with a new improved version of the RndHashtblEnum module. The suggestion is:

(* Improved Random Hashtable Enumeration Module *)
type ('a,'b) t = {ht:('a,'b) BatHashtbl.t; mutable enum:('a*'b) BatEnum.t; enum0: ('a*'b) BatEnum.t}

let shuffle ht =
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte
in BatArray.enum s

let create ht n =
let e = shuffle ht
in (BatRandom.init n; {ht=ht;enum=BatEnum.clone e;enum0=e})

let rec next re =
match BatEnum.get re.enum with
    | None -> re.enum<-re.enum0; next re
    | Some e -> e

This new module gets rid of the (silly) costs of passing an array to a list and only uses the Fisher-Yates algorithm once at the start -- so, in the long run, we can consider the contribution of the Fisher-Yates bit to be O(1).

The new version is now fair in terms of randomness. This is not so easy to see and it took me a bit to realize that. Suppose the hash table has 1000 entries. In the new version, we always use the same enumeration (enum0 -- fixed when we create the random enumeration with the "create" function). This means that, when trying to find the next element in our final list, since some key in the hash table will have to satisfy the "decide" function (otherwise we would couldn't continue with the algorithm and we would just stop), it will do so somewhere between the 0th and the 999th entry. Suppose it's on entry 300. Now, given we've chosen this key, for deciding the next key in the final list, our enumeration will continue with the remaining 700 elements and then will go on to the next 300 in the copy of the same enumeration. So, the 700+300 make exactly the 1000 in the hash table. This means that we will always consider each entry in the hash table once and only once. The other thing is that each time we attempt to find a key to go in the list that label could be found on entry 300 but also on entry 734 or something else, because the decide function actually depends on which previous keys have been chosen up to that point in the final list. So, each time we start fresh looking for an element for the final list in the hash table we start with a random element of the hash table.

Sorry if this is not very clear. It's hard to explain. =)

Thanks for all the comments.

Best, Surikator.

FOURTH AND FINAL ATTEMPT -- THIS IS MY PROPOSED SOLUTION

Hi yet again,

Sharing gasche's worries about mutable fields and enumerations in general and all the weird side-effects that can come from there, I decided to forget about off-the-shelf solutions using available hash tables libraries and wrote my stuff using plain lists. I also brought laziness to deal with avoiding generating random lists of which only part would be used (so there was useful lazy stuff to be used as you suggested, Niki).

I created the type

type 'a node_t =
   | ENil
   | ECons of 'a * 'a list * 'a t
and 'a t = ('a node_t) Lazy.t

for lazy random enumerations of lists. Each enumeration is either empty (ENil) or not (ECons) in which case it has three parts to it: (1) the element currently in focus, (2) the rest of available elements to enumerate, (3) another enumeration to continue this enumeration.

Then, a random enumeration of a list can be obtained using the create function

let rec create ls =
lazy(   match ls with
    | [] -> ENil
    | h::t -> let n = Random.int (List.length ls)
              in let newx,rest=remove ls n
          in ECons(newx,rest,create t))

where the auxiliary remove function has been defined to extract the nth element of a list and return a pair (x,ls) where x is the extracted element and ls is the new list without the extracted element. Just for completeness I add the code of the remove function here too.

let rec remove ls n =
let rec remove_ ls acc k n =
    match ls with
        | []        -> raise (Failure "remove")
        | h::t  -> if k=n
            then    h, List.rev_append acc t
            else remove_ t (h::acc) (k+1) n
in remove_ ls [] 0 n

We can now define very simple functions for generating the next state of the random enumeration and for getting the actual element in each state of the enumeration. Those are

exception End_of_enum
let next e =
match Lazy.force e with
    | ENil -> raise End_of_enum
    | ECons(x,ls,t) -> t
let rec get e =
match Lazy.force e with
    | ENil -> raise End_of_enum
    | ECons(x,ls,t) -> x

OK, up until now, I've simply enumerated lists randomly. If we want to enumerate a hash table instead, we can use

let rand_enum ht =
let ls = Hashtbl.fold (fun k v acc -> (k, v) :: acc) ht []
in create ls

to obtain a random enumeration of the pairs in the hash table and we can use next and get to obtain the (key,value) pairs. The fold but is just a way of getting all the (key,value) pairs of the hash table in a list (thanks Pascal for your answer in this question).

This ends the whole hash table enumeration thing. For the sake of completeness, I'm adding also the solution to the overall problem I was trying to solve, explained in "The Context" above. The problem, if you recall, is to randomly generate a list of (key,value) pairs from (1) a hash table, and (2) a decide function which can tell whether a (key,value) can be appended to some particular list of pairs. Since the whole generation process may never terminate, to ensure termination I thought it would make sense to have a third argument which is a function which tells whether we should stop the process or not (and which we should make sure will return true at some point for the whole process to terminate).

The function generate could be something like

let generate ht d stop =
let rec gen1 d fst e =
    if d (List.rev fst) (get e)
                then (get e)::fst
                else gen1 d fst (next e)
in let rec generate_ ht d stop acc =
            let e = rand_enum ht
            in  if stop acc
                        then acc
                        else    try generate_ ht d stop (gen1 d acc e)
                          with End_of_enum -> generate_ ht d stop (List.tl acc)
in generate_ ht d stop []

Thanks a lot to everyone who contributed with useful comments. This was really helpful.

All the best, Surikator.

like image 882
Surikator Avatar asked Oct 29 '10 12:10

Surikator


1 Answers

I have two suggestions. The first is to change your rand_enum function so it returns an Enum.t:

let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in Array.enum (BatRandom.shuffle hte)

which isn't terribly different (it's still computing a random enum for all 20k) but is closer to what you originally wanted.

Alternatively, you could always take the source code of HashTbl and recompile it with a rand_enum function. However this also probably won't be that different, as a HashTbl is implemented as an array and if you want to avoid bad duplicates you're probably going to end up using a shuffle.

like image 64
Niki Yoshiuchi Avatar answered Jan 11 '23 08:01

Niki Yoshiuchi