How to create N
"random" strings of length K
using the probability table? K
would be some even number.
prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}
Let's say K = 6
, there would be a higher probability of 'acacab'
than 'aaaaaa'
.
This is sub-problem of a larger problem that I’m using to generate synthetic sequences based on a probability table. I’m not sure how to use the probability table to generate “random” strings?
What I have so far:
def seq_prob(fprob_table,K= 6, N= 10):
#fprob_table is the probability dictionary that you input
#K is the length of the sequence
#N is the amount of sequences
seq_list = []
#possibly using itertools or random to generate the semi-"random" strings based on the probabilities
return seq_list
There are some good approaches to making weighted random choices described at the end of the documentation for the builtin random
module:
A common task is to make a random.choice() with weighted probabilities.
If the weights are small integer ratios, a simple technique is to build a sample population with repeats:
>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'
A more general approach is to arrange the weights in a cumulative distribution with itertools.accumulate(), and then locate the random value with bisect.bisect():
>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'
To adapt that latter approach to your specific problem, I'd do:
import random
import itertools
import bisect
def seq_prob(fprob_table, K=6, N=10):
choices, weights = fprob_table.items()
cumdist = list(itertools.accumulate(weights))
results = []
for _ in range(N):
s = ""
while len(s) < K:
x = random.random() * cumdist[-1]
s += choices[bisect.bisect(cumdist, x)]
results.append(s)
return results
This assumes that the key strings in your probability table are all the same length If they have multiple different lengths, this code will sometimes (perhaps most of the time!) give answers that are longer than K
characters. I suppose it also assumes that K
is an exact multiple of the key length, though it will actually work if that's not true (it just will give result strings that are all longer than K
characters, since there's no way to get K
exactly).
You could use random.random
:
from random import random
def seq_prob(fprob_table, K=6, N=10):
#fprob_table is the probability dictionary that you input
#K is the length of the sequence
#N is the amount of sequences
seq_list = []
s = ""
while len(seq_list) < N:
for k, v in fprob_table.items():
if len(s) == K:
seq_list.append(s)
s = ""
break
rn = random()
if rn <= v:
s += k
return seq_list
This can be no doubt be improved upon but the random.random
is useful when dealing with probability.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With