Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create large random boolean matrix with numpy

I am trying to create a huge boolean matrix which is randomly filled with True and False with a given probability p. At first I used this code:

N = 30000 p = 0.1 np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])   

But sadly it does not seem to terminate for this big N. So I tried to split it up into the generation of the single rows by doing this:

N = 30000 p = 0.1 mask = np.empty((N, N)) for i in range (N):      mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])                  if (i % 100 == 0):           print(i) 

Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated - but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True entries with probability p and False entries with probability 1-p?

Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won't be a problem.

like image 944
zimmerrol Avatar asked Apr 20 '17 19:04

zimmerrol


People also ask

How do you create a Boolean matrix in numpy?

A boolean array can be created manually by using dtype=bool when creating the array. Values other than 0 , None , False or empty strings are considered True. Alternatively, numpy automatically creates a boolean array when comparisons are made between arrays and scalars or between arrays of the same shape.

How do I generate a random matrix in numpy?

To create a matrix of random integers in Python, randint() function of the numpy module is used. This function is used for random sampling i.e. all the numbers generated will be at random and cannot be predicted at hand. Parameters : low : [int] Lowest (signed) integer to be drawn from the distribution.

What is random rand in Python?

Python Random randint() Method The randint() method returns an integer number selected element from the specified range. Note: This method is an alias for randrange(start, stop+1) .


1 Answers

The problem is your RAM, the values are being stored in memory as it's being created. I just created this matrix using this command:

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

I used an AWS i3 instance with 64GB of RAM and 8 cores. To create this matrix, htop shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:

time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])  CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s Wall time: 21.7 s    def mask_method(N, p):     for i in range(N):         mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])         if (i % 100 == 0):             print(i)  time mask_method(N,p)  CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s Wall time: 22.5 s 

Note that the mask method only takes up ~9GB of RAM at it's peak.

Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.

like image 140
gold_cy Avatar answered Sep 19 '22 21:09

gold_cy