Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for exclusion of numbers

Tags:

You are given a integer N which fits in long(less than 2^63-1) and 50 other integers. Your task is to find how many numbers from 1 to N contain none of the 50 numbers as its substring?

This question is from an interview.

like image 887
user1499215 Avatar asked Jul 03 '12 15:07

user1499215


2 Answers

You didn't specify a base, but I'll assume decimal without loss of generality.

First, recognize that this is a string problem, not a numeric one.

Build a finite automaton A to recognize the 50 integers as substrings of other strings. E.g., the two integers 44 and 3 are recognized as substrings by the RE

^.*(44|3).*$

Build a finite automaton B to recognize all numbers less than N. E.g., to recognize 1 through 27 (inclusive) in decimal, that can be achieved by compiling the RE

^([1-9]|1[0-9]|2[0-7])$

Compute the intersection C of the automata A and B, which is in turn an FA.

Use a dynamic programming algorithm to compute the size of the language recognized by C. Subtract that from the size of the language recognized by A, computed by the same algorithm.

(I am not claiming that this is asymptotically optimal, but it was a fast enough to solve lots of Project Euler problems :)

like image 78
Fred Foo Avatar answered Oct 07 '22 21:10

Fred Foo


This is only an explanation of what larsmans already wrote. If you like this answer, please vote him up in addition.

A finite automaton, FA, is just a set of states, and rules saying that if you are in state S and the next character you are fed is c then you transition to state T. Two of the states are special. One means, "start here" and the other means "I successfully matched". One of the characters is special, and means "the string just ended". So you take a string and a finite automaton, start in the starting state, keep feeding characters into the machine and changing states. You fail to match if you give any state unexpected input. You succeed in matching if you ever reach the state "I successfully matched".

Now there is a well-known algorithm for converting a regular expression into a finite automaton that matches a string if and only if that regular expression matches. (If you've read about regular expressions, this is how DFA engines work.) To illustrate I'll use the pattern ^.*(44|3).*$ which means "start of the string, any number of characters, followed by either 44 or 3, followed by any number of characters, followed by the end of the string."

First let's label all of the positions we can be in in the regular expression when we're looking for the next character: ^A.*(4B4|3)C.*$

The states of our regular expression engine will be subsets of those positions, and the special state matched. The result of a state transition will be the set of states we could get to if we were at that position, and saw a particular character. Our starting position is at the start of the RE, which is {A}. Here are the states that can be reached:

S1: {A}   # start
S2: {A, B}
S3: {A, C}
S4: {A, B, C}
S5: matched

Here are the transition rules:

S1:
  3: S3
  4: S2
  end of string: FAIL
  any other char: S1
S2:
  3: S3
  4: S3
  end of string: FAIL
  any other char: S1
S3:
  4: S4
  end of string: S5 (match)
  any other char: S3
S4:
  end of string: S5 (match)
  any other char: S4

Now if you take any string, start that in state S1, and follow the rules, you'll match that pattern. The process can be long and tedious, but luckily can be automated. My guess is that larsmans has automated it for his own use. (Technical note, the expansion from "positions in the RE" to "sets of positions you might possibly be in" can be done either up front, as here, or at run time. For most REs it is better to do it up front, as here. But a tiny fraction of pathological examples will wind up with a very large number of states, and it can be better to do those at run-time.)

We can do this with any regular expression. For instance ^([1-9]|1[0-9]|2[0-7])$ can get labeled: ^A([1-9]|1B[0-9]|2C[0-7])D$ and we get the states:

T1: {A}
T2: {D}
T3: {B, D}
T4: {C, D}

and transitions:

T1:
  1: T3
  2: T4
  3-9: T2
  any other char: FAIL
T2:
  end of string: MATCH
  any other char: FAIL
T3:
  0-9: T2
  end of string: MATCH
  any other char: FAIL
T4:
  0-7: T2
  end of string: MATCH
  any other char: FAIL

OK, so we know what a regular expression is, what a finite automaton, and how they relate. What is the intersection of two finite automata? It is just a finite automaton that matches when both finite automata individually match, and otherwise fails to match. It is easy to construct, its set of states is just the set of pairs of a state in the one, and a state in the other. Its transition rule is to just apply the transition rule for each member independently, if either fails the whole does, if both match they both do.

For the above pair, let's actually execute the intersection on the number 13. We start in state (S1, T1)

state: (S1, T1)  next char: 1
state: (S1, T3)  next char: 3
state: (S3, T2)  next char: end of string
state: (matched, matched) -> matched

And then on the number 14.

state: (S1, T1)  next char: 1
state: (S1, T3)  next char: 4
state: (S2, T2)  next char: end of string
state: (FAIL, matched) -> FAIL

Now we come to the whole point of this. Given that final finite automata, we can use dynamic programming to figure out how many strings there are that match it. Here is that calculation:

0 chars:
  (S1, T1): 1
    -> (S1, T3): 1 # 1
    -> (S1, T4): 1 # 2
    -> (S3, T2): 1 # 3
    -> (S2, T2): 1 # 4
    -> (S1, T2): 5 # 5-9
1 chars:
  (S1: T2): 5      # dead end
  (S1, T3): 1
    -> (S1, T2): 8 # 0-2, 5-9
    -> (S2, T2): 1 # 3
    -> (S3, T2): 1 # 4
  (S1, T4): 1
    -> (S1, T2): 6 # 0-2, 5-7
    -> (S2, T2): 1 # 3
    -> (S3, T2): 1 # 4
  (S2, T2): 1      # dead end
  (S3, T2): 1
    -> match:    1 # end of string
2 chars:
  (S1, T2): 14     # dead end
  (S2, T2): 2      # dead end
  (S3, T2): 2
    -> match     2 # end of string
  match:    1
    -> match     1 # carry through the count
3 chars:
  match:    3

OK, that's a lot of work, but we found that there are 3 strings that match both of those rules simultaneously. And we did it in a way that is automatable and scaleable to much larger numbers.

Of course the question we were originally posed was how many matched the second but not the first. Well we know that 27 match the second rule, 3 match both, so 24 must match the second rule but not the first.

As I said before, this is just larsmans solution elucidated. If you learned something, upvote him, vote for his answer. If this material sounds interesting, go buy a book like Progamming Language Pragmatics and learn a lot more about finite automata, parsing, compilation, and the like. It is a very good skillset to have, and far too many programmers don't.

like image 45
btilly Avatar answered Oct 07 '22 19:10

btilly