I need to generate a large array with repeated elements, and my code is:
np.repeat(xrange(x,y), data)
However, data is a numpy array with type float64 (but it represent integeres, no 2.1 there) and I get the error
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
Exemple:
In [35]: x
Out[35]: 26
In [36]: y
Out[36]: 50
In [37]: data
Out[37]:
array([ 3269., 106., 5533., 317., 1512., 208., 502., 919.,
406., 421., 1690., 2236., 705., 505., 230., 213.,
307., 1628., 4389., 1491., 355., 103., 854., 424.])
In [38]: np.repeat(xrange(x,y), data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-105860821359> in <module>()
----> 1 np.repeat(xrange(x,y), data)
/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy /core/fromnumeric.pyc in repeat(a, repeats, axis)
394 repeat = a.repeat
395 except AttributeError:
--> 396 return _wrapit(a, 'repeat', repeats, axis)
397 return repeat(repeats, axis)
398
/home/pcadmin/anaconda2/lib/python2.7/site-packages/numpy /core/fromnumeric.pyc in _wrapit(obj, method, *args, **kwds)
46 except AttributeError:
47 wrap = None
---> 48 result = getattr(asarray(obj), method)(*args, **kwds)
49 if wrap:
50 if not isinstance(result, mu.ndarray):
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
I solve it by changing the code to
np.repeat(xrange(x,y), data.astype('int64'))
However, this is now one of the most expensive lines in my code!! Is there another alternative?
By the way, I using this inside
np.random.choice(np.repeat(xrange(x,y), data.astype('int64')), z)
in order to get a sample without replacement with size z of the integers between x and y, with the number of each given in data. I guess this is the best approach for that also right?
Lurking in the question is the multivariate hypergeometric distribution. In Numpy drawing from urn, I implemented a function that draws samples from this distribution. I suspect it is very similar to the solution @DiogoSantos described in an answer. Diogo says that using this approach is slow, but I find the following to be faster than Divakar's optmized_v1
.
Here is a function that uses sample(n, colors)
from the linked answer to implement a function with the same signature as Divakar's functions.
def hypergeom_version(x, y, z, data):
s = sample(z, data)
result = np.repeat(np.arange(x, y), s)
return result
(This returns the values in sorted order. If you need the values to be in random order, add np.random.shuffle(result)
before the return statement. It does not change the execution time significantly.)
Comparison:
In [153]: x = 100
In [154]: y = 100100
In [155]: z = 10000
In [156]: data = np.random.randint(1, 125, (y-x)).astype(float)
Divakar's optimized_v1
:
In [157]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 520 ms per loop
hypergeom_version
:
In [158]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 244 ms per loop
If the values in data
are larger, the relative performance is even better:
In [164]: data = np.random.randint(100, 500, (y-x)).astype(float)
In [165]: %timeit optimized_v1(x, y, z, data)
1 loop, best of 3: 2.91 s per loop
In [166]: %timeit hypergeom_version(x, y, z, data)
1 loop, best of 3: 246 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With