Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate big random sequence of unique numbers [duplicate]

Tags:

python

random

I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).

I tried this:

# coding: utf-8
import random

COUNT = 100000000

random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
    file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()

But it's eating all of my memory.

Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?

like image 562
warvariuc Avatar asked Apr 27 '13 12:04

warvariuc


People also ask

How do I create a random sequence in Excel?

If you want to use RAND to generate a random number but don't want the numbers to change every time the cell is calculated, you can enter =RAND() in the formula bar, and then press F9 to change the formula to a random number.


2 Answers

If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).

As DSM pointed out, this can be done with the standard modules in an efficient way:

>>> import array
>>> a = array.array('I', xrange(10**8))  # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random                                                               
>>> random.shuffle(a)

It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:

>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32')  # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)

(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).


Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.

PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!

like image 51
Eric O Lebigot Avatar answered Oct 26 '22 02:10

Eric O Lebigot


Maybe something like (won't be consecutive, but will be unique):

from uuid import uuid4

def unique_nums():  # Not strictly unique, but *practically* unique
    while True:
        yield int(uuid4().hex, 16)
        # alternative yield uuid4().int

unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...
like image 30
Jon Clements Avatar answered Oct 26 '22 03:10

Jon Clements