How to determine the number of possible combinations of letters that contain a degenerate substring

Question

I've been racking my brain for a couple of days to work out a series or closed-form equation to the following problem:

Specifically: given all strings of length N that draws from an alphabet of L letters (starting with 'A', for example {A, B}, {A, B, C}, ...), how many of those strings contain a substring that matches the pattern: 'A', more than 1 not-'A', 'A'. The standard regular expression for that pattern would be A[^A][^A]+A.

The number of possible strings is simple enough: L^N . For small values of N and L, it's also very practical to simply create all possible combinations and use a regular expression to find the substrings that match the pattern; in R:

all.combinations <- function(N, L) {
    apply(
        expand.grid(rep(list(LETTERS[1:L]), N)),
        1,
        paste,
        collapse = ''
    )
}

matching.pattern <- function(N, L, pattern = 'A[^A][^A]+A') {
    sum(grepl(pattern, all.combinations(N, L)))
}

all.combinations(4, 2)
matching.pattern(4, 2)

I had come up with the following, which works for N < 7:

M <- function(N, L) {
    sum(
        sapply(
            2:(N-2),
            function(g) {
                (N - g - 1) * (L - 1) ** g * L ** (N - g - 2)
            }
        )
    )
}

Unfortunately, that only works while N < 7 because it's simply adding the combinations that have substrings A..A, A...A, A....A, etc. and some combinations obviously have multiple matching substrings (e.g., A..A..A, A..A...A), which are counted twice.

Any suggestions? I am open to procedural solutions too, so long as they don't blow up with the number of combinations (like my code above would). I'd like to be able to compute for values of N from 15 to 25 and L from 2 to 10.

For what it is worth, here's the number of combinations, and matching combinations for some values of N and L that are tractable to determine by generating all combinations and doing a regular expression match:

 N  L  combinations  matching
--  -  ------------  --------
 4  2            16         1
 5  2            32         5
 6  2            64        17
 7  2           128        48
 8  2           256       122
 9  2           512       290
10  2          1024       659
 4  3            81         4
 5  3           243        32
 6  3           729       172
 7  3          2187       760
 8  3          6561      2996
 9  3         19683     10960
10  3         59049     38076
 4  4           256         9
 5  4          1024        99
 6  4          4096       729
 7  4         16384      4410
 8  4         65536     23778
 9  4        262144    118854
10  4       1048576    563499

Ante · Accepted Answer

It is possible to use dynamic programming approach.

For fixed L, let X(n) be number of strings of length n that contain given pattern, and let A(n) be number of strings of length n that contain given pattern and starts with A.

First derive recursion formula for A(n). Lets count all strings in A(n) by grouping them by first 2-3 letters. Number of strings in A(n) with:

"second letter A" is A(n-1),
"second letter non-A and third letter is A" is A(n-2),
"second and third letter non-A" is (L^(n-3) - (L-1)^(n-3)). That is because string 'needs' at least one A in remaining letters to be counted.

With that:

A(n) = A(n-1) + (L-1) * (A(n-2) + (L-1) * (L^(n-3) - (L-1)^(n-3)))

String of length n+1 can start with A or non-A:

X(n+1) = A(n+1) + (L-1) * X(n)
X(i) = A(i) = 0, for i <= 3

Python implementation:

def combs(l, n):
    x = [0] * (n + 1)  # First element is not used, easier indexing
    a = [0] * (n + 1)
    for i in range(4, n+1):
        a[i] = a[i-1] + (l-1) * (a[i-2] + (l-1) * (l**(i-3) - (l-1)**(i-3)))
        x[i] = a[i] + (l-1) * x[i-1]
    return x[4:]

print(combs(2, 10))
print(combs(3, 10))
print(combs(4, 10))

How to determine the number of possible combinations of letters that contain a degenerate substring

Tags:

regex

r

combinatorics

probability

longest-substring

James McIninch

1 Answers

Ante

Recent Activity

Donate For Us

How to determine the number of possible combinations of letters that contain a degenerate substring

Tags:

regex

r

combinatorics

probability

longest-substring

James McIninch

1 Answers

Ante

Related questions

Recent Activity

Donate For Us