I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).
I tried this:
# coding: utf-8
import random
COUNT = 100000000
random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()
But it's eating all of my memory.
Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?
If you want to use RAND to generate a random number but don't want the numbers to change every time the cell is calculated, you can enter =RAND() in the formula bar, and then press F9 to change the formula to a random number.
If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).
As DSM pointed out, this can be done with the standard modules in an efficient way:
>>> import array
>>> a = array.array('I', xrange(10**8)) # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random
>>> random.shuffle(a)
It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:
>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32') # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)
(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).
Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.
PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!
Maybe something like (won't be consecutive, but will be unique):
from uuid import uuid4
def unique_nums(): # Not strictly unique, but *practically* unique
while True:
yield int(uuid4().hex, 16)
# alternative yield uuid4().int
unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With