Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random sampling of non-overlapping substrings of length k

Given a string of length n, how would I (pseudo)randomly sample m substrings of size k such that none of the sampled substrings overlap? Most of my scripting experience is in Perl, but an easy-to-run solution in any common language will suffice.

like image 213
Daniel Standage Avatar asked Oct 04 '22 02:10

Daniel Standage


2 Answers

If there is a character that cannot occur in the input, e.g. X, just:

my $size = 20;
my $count = 20;
my $mark = 'X';
my $input = 'CCACGCATTTTTGTTCATTGTTCTGGCTTCTTACAAGGTTCAGTAGACTTTGTAACACAGTTGTGTCTCTCACAGATTGGCAGATGTTTGGTAAAGGATTGACTTTTCAGCCAACTCATGGGAAAGTGAAATAATGTAAAAAACAGGAAGAATACAGTTTTAGGCCTTTCAAGTGAGGCATGGCTTTCAGCTCTTGGCAAGAACAGGCAAGGAGATGCAAGTTTTAGGACTCTAAGAGGCTAGGCTTTTCAAAGTGCTTCTCTCCCCTTCACCCTCCTTCAGTTACAGCACCAAGCACCACCGAGGTGTTACCTGCAGCCTCACTCTCTACCTGGTTGTGGGATCCTGCCACTTCCTTAACCCACACTGAGTTCCTTGTGGTTCACAGGGTCACACAGAGGGCTGTAGAGATACAAAAGATATATGTGATTTTATATCACCTATCATATGAAGATATATTTATAAAATAGGAAACATATTAACCACTTATCATTTTATATATTTATGGTTTTATGTGTCAAAAATATATTGTTTCATGTATGTATTAAAGGATAAGTATGTATAAGAGGTTTTATAGATGTGTAAAATTATATATTTATACGTATCTTTACAAATTTAAGAATAAAGGAAGGAAAATTCTCAAAGAGGAATTCAGATATCAAGCAGTGCCCTTTGACCAAGAGCCTTGGTTACAACATACCTACAAAAGTGAACTATCATTGAAAGACCTATGGACACTGGATTTCTCTTTCCTTATTTAGAAGGGCAGTCTGTGTCTTGGAAAAGCATACAGTTTGTTGTATCTTGCTGGACAACAGGAGTCA';

if (2*$size*$count-$size-$count >= length($input)) {
    die "selection may not complete; choose a shorter length or fewer substrings, or provide a longer input string\n";
}

my @substrings;
while (@substrings < $count) {
    my $pos = int rand(length($input)-$size+1);
    push @substrings, substr($input, $pos, $size, $mark x $size)
        if substr($input, $pos, $size) !~ /\Q$mark/;
}
like image 137
ysth Avatar answered Oct 13 '22 12:10

ysth


This is a recursive approach in Python. At each step, randomly select from among the remaining partitions of the string, then randomly select a substring of length k from the chosen partition. Replace this partition with the split of the partition on the substring chosen. Filter out partitions of length smaller than k, and repeat. The list of substrings returns when there are m of them, or there are no partitions left with length greater than or equal to k.

import random

def f(l, k, m, result=[]):
    if len(result) == m or len(l) == 0:
        return result
    else:
        if isinstance(l, str):
            l = [l]
        part_num = random.randint(0, len(l)-1)
        partition = l[part_num]
        start = random.randint(0, len(partition)-k)
        result.append(partition[start:start+k])
        l.remove(partition)
        l.extend([partition[:start], partition[start+k:]])
        return f([part for part in l if len(part) >= k], k, m, result)
like image 35
Matthew Plourde Avatar answered Oct 13 '22 11:10

Matthew Plourde