Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time and memory-efficient random sampling with python/numpy

I need to calculate the expected value of distance between 2 random uniformly distributed points within a unit square. This problem can be solved analytically and the answer is approximately 0.521405. What I'm trying to accomplish is to come up with this number using random number generator and Python with and without numpy. Using the below code

dist=list()
random.seed()
for each in range(100000000):
    x1=random.random()
    x2=random.random()
    y1=random.random()
    y2=random.random()
    dist.append(math.sqrt((x1-x2)**2+(y1-y2)**2))
print(round(sum(dist)/float(len(dist)),6))

It iterates 100 million times in 125 seconds on my machine, but only 4 decimal digits are correct. Now, with the numpy I created the following code

dist=list()
start_time = time.time()
np.random.seed()
for each in range(100000000):
    x = np.array((np.random.random(),np.random.random()))
    y = np.array((np.random.random(),np.random.random()))
    dist.append(np.linalg.norm(x-y))
print(round(np.mean(dist),6))

and it took 1111 seconds to iterate the same 100 million times!

As the result was correct only up to 4 decimal digits, I tried to increase the number of iterations to 1 billion using the former version, without numpy. I figured, that since each float is at most 64 bits(i'm using 64 bit Python) the list would take roughly 8 GB. However, the program used up 26GB of memory and errored out with an exception when the list had 790 million items

So I'm seeking your advice on the following:

  1. Is there a way to make use of various numpy optimizations and actually make it work faster?
  2. Is there a way to make the program more memory-efficient? I realize that the list is a way more complex data structure than I need for my purpose
  3. Am I correct assuming that in order to get 6 decimal digits right I will need the number of iterations close to 10^12? (Because the standard error of N measurements is decreasing as 1/sqrt(N))

Thanks in advance!

like image 536
Nick Slavsky Avatar asked Feb 21 '26 05:02

Nick Slavsky


2 Answers

To answer the first two sub-questions, here's an approach that generates all the random numbers in one go and then makes use of np.einsum to replace most of np.linalg.norm's work (squaring the differentiations and sum along the rows) , keeping the rest of the code as it is. The implementation would look something like this -

N = 100000 # Edit this to change number of random points
x,y = np.random.random((2,N,2))
diffs = x-y
out = round(np.mean(np.sqrt(np.einsum('ij,ij->i',diffs,diffs))),6)
like image 189
Divakar Avatar answered Feb 22 '26 19:02

Divakar


As for the memory issue, your assumption about 'float takes 64 bits' is not quite correct. Every float object (and it is, actually, boxed object in python) will take 24 bytes. You can check it using sys.getsizeof(0.0). Thus, your program should require 3 times more space than you estimated, which is approximately what you actually have experienced.

like image 32
Gadzhi Musaev Avatar answered Feb 22 '26 17:02

Gadzhi Musaev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!