Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anything wrong with my Fisher-Yates shuffle?

Aware that when something seems too good to be true it usually is, I figured I would pose this question to hopefully flush out any gremlins. I reviewed the few related threads that I could find, but still my question lingers.

I am relatively new to Haskell, and in my experimentation I coded up a basic Fisher-Yates shuffle as shown below.

shuffle :: RandomGen g => [a] -> g -> ([a],g)
shuffle [] g0 = ([],g0)
shuffle [x] g0 = ([x],g0)
shuffle xs g0 = (x:newtail,g2)
  where (i,g1) = randomR (0, length $ tail xs) g0
        (xs1,x:xs2) = splitAt i xs
        (newtail,g2) = shuffle (xs1++xs2) g1

This implementation of course uses beaucoup memory for large lists, but it's fast - on my laptop avg 5s for 30M ints vs. Std C++ shuffle at 2.3s). In fact, it is much faster than other Haskell implementations have found elsewhere.(e.g., http://www.haskell.org/haskellwiki/Random_shuffle)

Given other Haskell shuffles I've seen are both more complicated and slower, I am wondering whether the speedup/simplicity is simply my reward for being a unapologetic memory hog, or if I have missed some tiny but crucial detail that makes my algorithm biased. I have not tested extensively, but a preliminary look seems to show a uniform distribution of permutations.

I would appreciate the assessment of more eyes with more Haskell and/or shuffling experience. Many thanks in advance to all who take the time to reply.

like image 665
Tientuinë Avatar asked Apr 26 '13 18:04

Tientuinë


People also ask

How does the Fisher Yates shuffle work?

The Fisher–Yates shuffle, as implemented by Durstenfeld, is an in-place shuffle. That is, given a preinitialized array, it shuffles the elements of the array in place, rather than producing a shuffled copy of the array. This can be an advantage if the array to be shuffled is large.

How do I shuffle an optimally array?

Shuffle Array using Random Class We can iterate through the array elements in a for loop. Then, we use the Random class to generate a random index number. Then swap the current index element with the randomly generated index element. At the end of the for loop, we will have a randomly shuffled array.

How does shuffle algorithm work?

First, the algorithm spreads all the songs from the same artist all over the playlist as evenly as possible. Then, all the songs of all artists are collected an ordered by position.

How do you shuffle items in an array Python?

Using shuffle() method from Random library to shuffle the given array. Here we are using shuffle method from the built-in random module to shuffle the entire array at once.


1 Answers

Let's do some proper benchmarking. Here's some code, with your shuffle renamed to shuffle1, and my personal favorite variant thrown in as shuffle2.

import System.Random

import Control.Monad

import Control.Monad.ST.Strict
import Data.STRef.Strict

import Data.Vector.Mutable

import Prelude as P

import Criterion.Main


shuffle1 :: RandomGen g => [a] -> g -> ([a], g)
shuffle1 [] g0 = ([],g0)
shuffle1 [x] g0 = ([x],g0)
shuffle1 xs g0 = (x:newtail,g2)
  where (i,g1) = randomR (0, P.length $ P.tail xs) g0
        (xs1,x:xs2) = P.splitAt i xs
        (newtail,g2) = shuffle1 (xs1++xs2) g1


shuffle2 :: RandomGen g => [a] -> g -> ([a], g)
shuffle2 xs g0 = runST $ do
    let l = P.length xs
    v <- new l
    sequence_ $ zipWith (unsafeWrite v) [0..] xs

    let loop g i | i <= 1 = return g
                 | otherwise = do
            let i' = i - 1
                (j, g') = randomR (0, i') g
            unsafeSwap v i' j
            loop g' i'

    gFinal <- loop g0 l
    shuffled <- mapM (unsafeRead v) [0 .. l - 1]
    return (shuffled, gFinal)


main = do
    let s1 x = fst $ shuffle1 x g0
        s2 x = fst $ shuffle2 x g0
        arr = [0..1000] :: [Int]
        g0 = mkStdGen 0
    -- make sure these values are evaluated before the benchmark starts
    print (g0, arr)

    defaultMain [bench "shuffle1" $ nf s1 arr, bench "shuffle2" $ nf s2 arr]

And so, let's see some results:

carl@ubuntu:~/hask$ ghc -O2 shuffle.hs
[1 of 1] Compiling Main             ( shuffle.hs, shuffle.o )
Linking shuffle ...
carl@ubuntu:~/hask$ ./shuffle 
(1 1,[0, .. <redacted for brevity>])
warming up
estimating clock resolution...
mean is 5.762060 us (160001 iterations)
found 4887 outliers among 159999 samples (3.1%)
  4751 (3.0%) high severe
estimating cost of a clock call...
mean is 42.13314 ns (43 iterations)

benchmarking shuffle1
mean: 10.95922 ms, lb 10.92317 ms, ub 10.99903 ms, ci 0.950
std dev: 193.8795 us, lb 168.6842 us, ub 244.6648 us, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 10.396%
variance is moderately inflated by outliers

benchmarking shuffle2
mean: 256.9394 us, lb 255.5414 us, ub 258.7409 us, ci 0.950
std dev: 8.042766 us, lb 6.460785 us, ub 12.28447 us, ci 0.950
found 1 outliers among 100 samples (1.0%)
  1 (1.0%) high severe
variance introduced by outliers: 26.750%
variance is moderately inflated by outliers

Ok, my system is really noisy, and shouldn't be used for serious benchmarking of things with similar numbers. But that hardly matters here. shuffle2 is approximately 40x faster than shuffle1 on a list with 1001 elements. Due to the differences between O(n) and O(n^2), that will only increase in with larger lists. I'm certain that whatever your test code was timing, it wasn't the shuffle algorithm.

Actually, I have a guess. Your version is lazy enough to return results incrementally. 5 seconds is a plausible period of time for getting the first few results, if you never touch the generator after the call to it. Maybe that's what's going on in your timing.

like image 115
Carl Avatar answered Sep 29 '22 11:09

Carl