Fast column shuffle of each row numpy

Question

I have a large 10,000,000+ length array that contains rows. I need to individually shuffle those rows. For example:

[[1,2,3]
 [1,2,3]
 [1,2,3]
 ...
 [1,2,3]]

to

[[3,1,2]
 [2,1,3]
 [1,3,2]
 ...
 [1,2,3]]

I'm currently using

map(numpy.random.shuffle, array)

But it's a python (not NumPy) loop and it's taking 99% of my execution time. Sadly, the PyPy JIT doesn't implement numpypy.random, so I'm out of luck. Is there any faster way? I'm willing to use any library (pandas, scikit-learn, scipy, theano, etc. as long as it uses a Numpy ndarray or a derivative.)

If not, I suppose I'll resort to Cython or C++.

CT Zhu · Accepted Answer

Here are some ideas:

In [10]: a=np.zeros(shape=(1000,3))

In [12]: a[:,0]=1

In [13]: a[:,1]=2

In [14]: a[:,2]=3

In [17]: %timeit map(np.random.shuffle, a)
100 loops, best of 3: 4.65 ms per loop

In [21]: all_perm=np.array((list(itertools.permutations([0,1,2]))))

In [22]: b=all_perm[np.random.randint(0,6,size=1000)]

In [25]: %timeit (a.flatten()[(b+3*np.arange(1000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1000 loops, best of 3: 393 us per loop

If there are only a few columns, then the number of all possible permutation is much smaller than the number of rows in the array (in this case, when there are only 3 columns, there are only 6 possible permutations). A way to make it faster is to make all the permutations at once first and then rearrange each row by randomly picking one permutation from all possible permutations.

It still appears to be 10 times faster even with larger dimension:

#adjust a accordingly
In [32]: b=all_perm[np.random.randint(0,6,size=1000000)]

In [33]: %timeit (a.flatten()[(b+3*np.arange(1000000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1 loops, best of 3: 348 ms per loop

In [34]: %timeit map(np.random.shuffle, a)
1 loops, best of 3: 4.64 s per loop

unutbu · Answer

If the permutations of the columns are enumerable, then you could do this:

import itertools as IT
import numpy as np

def using_perms(array):
    nrows, ncols = array.shape
    perms = np.array(list(IT.permutations(range(ncols))))
    choices = np.random.randint(len(perms), size=nrows)
    i = np.arange(nrows).reshape(-1, 1)
    return array[i, perms[choices]]

N = 10**7
array = np.tile(np.arange(1,4), (N,1))
print(using_perms(array))

yields (something like)

[[3 2 1]
 [3 1 2]
 [2 3 1]
 [1 2 3]
 [3 1 2]
 ...
 [1 3 2]
 [3 1 2]
 [3 2 1]
 [2 1 3]
 [1 3 2]]

Here is a benchmark comparing it to

def using_shuffle(array):
    map(numpy.random.shuffle, array)
    return array

In [151]: %timeit using_shuffle(array)
1 loops, best of 3: 7.17 s per loop

In [152]: %timeit using_perms(array)
1 loops, best of 3: 2.78 s per loop

Edit: CT Zhu's method is faster than mine:

def using_Zhu(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return (array.flatten()[(b+3*np.arange(nrows)[...,np.newaxis]).flatten()]
            ).reshape(array.shape)

In [177]: %timeit using_Zhu(array)
1 loops, best of 3: 1.7 s per loop

Here is a slight variation of Zhu's method which may be even a bit faster:

def using_Zhu2(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return array.take((b+3*np.arange(nrows)[...,np.newaxis]).ravel()).reshape(array.shape)

In [201]: %timeit using_Zhu2(array)
1 loops, best of 3: 1.46 s per loop

Fast column shuffle of each row numpy

Tags:

python

random

vectorization

numpy

PythonNut

2 Answers

CT Zhu

unutbu

Recent Activity

Donate For Us

Fast column shuffle of each row numpy

Tags:

python

random

vectorization

numpy

PythonNut

2 Answers

CT Zhu

unutbu

Related questions

Recent Activity

Donate For Us