Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient use of numpy.random.choice with repeated numbers and alternatives

I need to generate a large array with repeated elements, and my code is:

np.repeat(xrange(x,y), data)

However, data is a numpy array with type float64 (but it represent integeres, no 2.1 there) and I get the error

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Exemple:

In [35]: x
Out[35]: 26

In [36]: y
Out[36]: 50

In [37]: data
Out[37]: 
array([ 3269.,   106.,  5533.,   317.,  1512.,   208.,   502.,   919.,
     406.,   421.,  1690.,  2236.,   705.,   505.,   230.,   213.,
     307.,  1628.,  4389.,  1491.,   355.,   103.,   854.,   424.])
In [38]: np.repeat(xrange(x,y), data)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call    last)
<ipython-input-38-105860821359> in <module>()
----> 1 np.repeat(xrange(x,y), data)

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy    /core/fromnumeric.pyc in repeat(a, repeats, axis)
394         repeat = a.repeat
395     except AttributeError:
--> 396         return _wrapit(a, 'repeat', repeats, axis)
397     return repeat(repeats, axis)
398 

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy  /core/fromnumeric.pyc in _wrapit(obj, method, *args, **kwds)
 46     except AttributeError:
 47         wrap = None
---> 48     result = getattr(asarray(obj), method)(*args, **kwds)
 49     if wrap:
 50         if not isinstance(result, mu.ndarray):

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

I solve it by changing the code to

np.repeat(xrange(x,y), data.astype('int64'))

However, this is now one of the most expensive lines in my code!! Is there another alternative?

By the way, I using this inside

np.random.choice(np.repeat(xrange(x,y), data.astype('int64')), z)

in order to get a sample without replacement with size z of the integers between x and y, with the number of each given in data. I guess this is the best approach for that also right?

like image 847
Diogo Santos Avatar asked Jan 06 '23 10:01

Diogo Santos


1 Answers

Lurking in the question is the multivariate hypergeometric distribution. In Numpy drawing from urn, I implemented a function that draws samples from this distribution. I suspect it is very similar to the solution @DiogoSantos described in an answer. Diogo says that using this approach is slow, but I find the following to be faster than Divakar's optmized_v1.

Here is a function that uses sample(n, colors) from the linked answer to implement a function with the same signature as Divakar's functions.

def hypergeom_version(x, y, z, data):
    s = sample(z, data)
    result = np.repeat(np.arange(x, y), s)
    return result

(This returns the values in sorted order. If you need the values to be in random order, add np.random.shuffle(result) before the return statement. It does not change the execution time significantly.)

Comparison:

In [153]: x = 100

In [154]: y = 100100

In [155]: z = 10000

In [156]: data = np.random.randint(1, 125, (y-x)).astype(float)

Divakar's optimized_v1:

In [157]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 520 ms per loop

hypergeom_version:

In [158]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 244 ms per loop

If the values in data are larger, the relative performance is even better:

In [164]: data = np.random.randint(100, 500, (y-x)).astype(float)

In [165]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 2.91 s per loop

In [166]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 246 ms per loop
like image 177
Warren Weckesser Avatar answered Jan 10 '23 21:01

Warren Weckesser