Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to generate 1 million random integers and write them to a file?

I was trying to do some tests on my external sorting algorithms, and I thought I should generate a huge amount of random numbers and put them into a file.

Here is how I do it:

import tempfile, random

nf = tempfile.NamedTemporaryFile(delete=False)
i = 0
while i < 1000:
    j = 0
    buf = ''
    while j < 1000:
        buf += str(random.randint(0, 1000))
        j += 1
    nf.write(buf)
    i += 1

I thought, I should speed up the generating process by reducing the File IO operations, so I use buf to store as many numbers as possible, then write buf to the file.

Question:

I still got a sense that, the generating and writing process was slow.

Am I getting something wrong?

EDIT:

In C++, we can simply write an int or a float into file by << without converting them into string.

So can we do the same in Python? I mean write an integer into file without converting it into str.

like image 208
Alcott Avatar asked Dec 16 '22 08:12

Alcott


1 Answers

Operating systems are already optimized for such I/O operations. So, you can directly write the numbers to file and get a very good speed:

import tempfile, random

with tempfile.NamedTemporaryFile(delete=False) as nf:
    for _ in xrange(1000000):  # xrange() is more efficient than range(), in Python 2
        nf.write(str(random.randint(0, 1000)))

In practice, the numbers will only be written to the disk when the size-optimized file buffer is full. The code in the question and the code above take the same time, on my machine. So, I would advise to use my simpler code and to rely on the operating system's built-in optimizations.

If the result fits in memory (which is the case for 1 million numbers), then you can indeed save some I/O operations by creating the final string and then writing it in one go:

with tempfile.NamedTemporaryFile(delete=False) as nf:
    nf.write(''.join(str(random.randint(0, 1000)) for _ in xrange(1000000)))

This second approach is 30% faster, on my computer (2.6 s instead of 3.8 s), probably thanks to the single write call (instead of a million write() calls–and probably many fewer actual disk writes).

The "many big writes" approach of your question falls in the middle (3.1 s). It can be improved, though: it is clearer and more Pythonic to write it like this:

import tempfile, random

with tempfile.NamedTemporaryFile(delete=False) as nf:
    for _ in xrange(1000):
        nf.write(''.join(str(random.randint(0, 1000)) for _ in xrange(1000)))

This solution is equivalent to, but faster than the code in the original question (2.6 s on my machine, instead of 3.8 s).

In summary, the first, simple approach above might be fast enough for you. If it is not and if the whole file can fit in memory, the second approach is both very fast and simple. Otherwise, your initial idea (fewer writes, bigger blocks) is good, as it is about as fast as the "single write" approach, and still quite simple, when written as above.

like image 116
Eric O Lebigot Avatar answered Jan 04 '23 23:01

Eric O Lebigot