Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate N "random" string of length K using probability table

How to create N "random" strings of length K using the probability table? K would be some even number.

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}

Let's say K = 6, there would be a higher probability of 'acacab' than 'aaaaaa'.

This is sub-problem of a larger problem that I’m using to generate synthetic sequences based on a probability table. I’m not sure how to use the probability table to generate “random” strings?

What I have so far:

def seq_prob(fprob_table,K= 6, N= 10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    #possibly using itertools or random to generate the semi-"random" strings based on the probabilities 
    return seq_list
like image 234
nfs Avatar asked Mar 17 '23 20:03

nfs


2 Answers

There are some good approaches to making weighted random choices described at the end of the documentation for the builtin random module:

A common task is to make a random.choice() with weighted probabilities.

If the weights are small integer ratios, a simple technique is to build a sample population with repeats:

>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'

A more general approach is to arrange the weights in a cumulative distribution with itertools.accumulate(), and then locate the random value with bisect.bisect():

>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'

To adapt that latter approach to your specific problem, I'd do:

import random
import itertools
import bisect

def seq_prob(fprob_table, K=6, N=10):
    choices, weights = fprob_table.items()
    cumdist = list(itertools.accumulate(weights))

    results = []
    for _ in range(N):
        s = ""
        while len(s) < K:
            x = random.random() * cumdist[-1]
            s += choices[bisect.bisect(cumdist, x)]
        results.append(s)

    return results

This assumes that the key strings in your probability table are all the same length If they have multiple different lengths, this code will sometimes (perhaps most of the time!) give answers that are longer than K characters. I suppose it also assumes that K is an exact multiple of the key length, though it will actually work if that's not true (it just will give result strings that are all longer than K characters, since there's no way to get K exactly).

like image 116
Blckknght Avatar answered Mar 25 '23 05:03

Blckknght


You could use random.random:

from random import random
def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    s = ""
    while len(seq_list) < N:
        for k, v in fprob_table.items():
            if len(s) == K:
                seq_list.append(s)
                s = ""
                break
            rn = random()
            if rn <=  v:
                s += k
    return seq_list

This can be no doubt be improved upon but the random.random is useful when dealing with probability.

like image 32
Padraic Cunningham Avatar answered Mar 25 '23 07:03

Padraic Cunningham