I've been racking my brain for a couple of days to work out a series or closed-form equation to the following problem:
Specifically: given all strings of length N that draws from an alphabet of L letters (starting with 'A', for example {A, B}, {A, B, C}, ...), how many of those strings contain a substring that matches the pattern: 'A', more than 1 not-'A', 'A'. The standard regular expression for that pattern would be A[^A][^A]+A
.
The number of possible strings is simple enough: L^N . For small values of N and L, it's also very practical to simply create all possible combinations and use a regular expression to find the substrings that match the pattern; in R:
all.combinations <- function(N, L) {
apply(
expand.grid(rep(list(LETTERS[1:L]), N)),
1,
paste,
collapse = ''
)
}
matching.pattern <- function(N, L, pattern = 'A[^A][^A]+A') {
sum(grepl(pattern, all.combinations(N, L)))
}
all.combinations(4, 2)
matching.pattern(4, 2)
I had come up with the following, which works for N < 7:
M <- function(N, L) {
sum(
sapply(
2:(N-2),
function(g) {
(N - g - 1) * (L - 1) ** g * L ** (N - g - 2)
}
)
)
}
Unfortunately, that only works while N < 7 because it's simply adding the combinations that have substrings A..A, A...A, A....A, etc. and some combinations obviously have multiple matching substrings (e.g., A..A..A, A..A...A), which are counted twice.
Any suggestions? I am open to procedural solutions too, so long as they don't blow up with the number of combinations (like my code above would). I'd like to be able to compute for values of N from 15 to 25 and L from 2 to 10.
For what it is worth, here's the number of combinations, and matching combinations for some values of N and L that are tractable to determine by generating all combinations and doing a regular expression match:
N L combinations matching
-- - ------------ --------
4 2 16 1
5 2 32 5
6 2 64 17
7 2 128 48
8 2 256 122
9 2 512 290
10 2 1024 659
4 3 81 4
5 3 243 32
6 3 729 172
7 3 2187 760
8 3 6561 2996
9 3 19683 10960
10 3 59049 38076
4 4 256 9
5 4 1024 99
6 4 4096 729
7 4 16384 4410
8 4 65536 23778
9 4 262144 118854
10 4 1048576 563499
It is possible to use dynamic programming approach.
For fixed L
, let X(n)
be number of strings of length n
that contain given pattern, and let A(n)
be number of strings of length n
that contain given pattern and starts with A.
First derive recursion formula for A(n)
. Lets count all strings in A(n)
by grouping them by first 2-3 letters. Number of strings in A(n)
with:
A(n-1)
,A(n-2)
,(L^(n-3) - (L-1)^(n-3))
. That is because string 'needs' at least one A in remaining letters to be counted.With that:
A(n) = A(n-1) + (L-1) * (A(n-2) + (L-1) * (L^(n-3) - (L-1)^(n-3)))
String of length n+1
can start with A or non-A:
X(n+1) = A(n+1) + (L-1) * X(n)
X(i) = A(i) = 0, for i <= 3
Python implementation:
def combs(l, n):
x = [0] * (n + 1) # First element is not used, easier indexing
a = [0] * (n + 1)
for i in range(4, n+1):
a[i] = a[i-1] + (l-1) * (a[i-2] + (l-1) * (l**(i-3) - (l-1)**(i-3)))
x[i] = a[i] + (l-1) * x[i-1]
return x[4:]
print(combs(2, 10))
print(combs(3, 10))
print(combs(4, 10))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With