Efficient use of numpy.random.choice with repeated numbers and alternatives

Question

I need to generate a large array with repeated elements, and my code is:

np.repeat(xrange(x,y), data)

However, data is a numpy array with type float64 (but it represent integeres, no 2.1 there) and I get the error

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Exemple:

In [35]: x
Out[35]: 26

In [36]: y
Out[36]: 50

In [37]: data
Out[37]: 
array([ 3269.,   106.,  5533.,   317.,  1512.,   208.,   502.,   919.,
     406.,   421.,  1690.,  2236.,   705.,   505.,   230.,   213.,
     307.,  1628.,  4389.,  1491.,   355.,   103.,   854.,   424.])
In [38]: np.repeat(xrange(x,y), data)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call    last)
<ipython-input-38-105860821359> in <module>()
----> 1 np.repeat(xrange(x,y), data)

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy    /core/fromnumeric.pyc in repeat(a, repeats, axis)
394         repeat = a.repeat
395     except AttributeError:
--> 396         return _wrapit(a, 'repeat', repeats, axis)
397     return repeat(repeats, axis)
398 

/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy  /core/fromnumeric.pyc in _wrapit(obj, method, *args, **kwds)
 46     except AttributeError:
 47         wrap = None
---> 48     result = getattr(asarray(obj), method)(*args, **kwds)
 49     if wrap:
 50         if not isinstance(result, mu.ndarray):

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

I solve it by changing the code to

np.repeat(xrange(x,y), data.astype('int64'))

However, this is now one of the most expensive lines in my code!! Is there another alternative?

By the way, I using this inside

np.random.choice(np.repeat(xrange(x,y), data.astype('int64')), z)

in order to get a sample without replacement with size z of the integers between x and y, with the number of each given in data. I guess this is the best approach for that also right?

Warren Weckesser · Accepted Answer

Lurking in the question is the multivariate hypergeometric distribution. In Numpy drawing from urn, I implemented a function that draws samples from this distribution. I suspect it is very similar to the solution @DiogoSantos described in an answer. Diogo says that using this approach is slow, but I find the following to be faster than Divakar's optmized_v1.

Here is a function that uses sample(n, colors) from the linked answer to implement a function with the same signature as Divakar's functions.

def hypergeom_version(x, y, z, data):
    s = sample(z, data)
    result = np.repeat(np.arange(x, y), s)
    return result

(This returns the values in sorted order. If you need the values to be in random order, add np.random.shuffle(result) before the return statement. It does not change the execution time significantly.)

Comparison:

In [153]: x = 100

In [154]: y = 100100

In [155]: z = 10000

In [156]: data = np.random.randint(1, 125, (y-x)).astype(float)

Divakar's optimized_v1:

In [157]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 520 ms per loop

hypergeom_version:

In [158]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 244 ms per loop

If the values in data are larger, the relative performance is even better:

In [164]: data = np.random.randint(100, 500, (y-x)).astype(float)

In [165]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 2.91 s per loop

In [166]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 246 ms per loop

Efficient use of numpy.random.choice with repeated numbers and alternatives

Tags:

python

casting

numpy

python-2.7

repeat

Diogo Santos

1 Answers

Warren Weckesser

Recent Activity

Donate For Us

Efficient use of numpy.random.choice with repeated numbers and alternatives

Tags:

python

casting

numpy

python-2.7

repeat

Diogo Santos

1 Answers

Warren Weckesser

Related questions

Recent Activity

Donate For Us