Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding smallest substring not present in string

I have a string consisting only of digits 0-9. The string can be between 1 and 1,000,000 characters in length. I need to find the smallest number that isn't present in the string, in linear time. Here are some examples:

1023456789       //Smallest number not in string is 11
1023479          //Smallest number not in string is 5
112131405678910  //Smallest number not in string is 15

With size of 1,000,000, I figured that the smallest number not present in the string has to be at most 6 digits.

My approach was to generate all numbers 0 through 999,999 and insert them all in a vector (in order). Then make a map that marks what strings have already been seen. Then I iterate through the string, and for each position I get all substring starting from it, size 1 to 6, and I mark all those substrings as true in the map. At the end, I just check all keys one by one, and when I find the first one that has a false value in the map, I print it.

Here are some code snippets:

string tmp="0";
string numbers[999999];

void increase(int pos)
{
    if(pos==-1)tmp.insert(0,"1");
    else if(tmp.at(pos)!='9')tmp.at(pos)++;
    else
    {
        tmp.at(pos)='0';
        increase(pos-1);
    }
}

//And later inside main
for(int j=0;j<999999;j++)
{
    numbers[j]=tmp;
    increase(tmp.size()-1);
}

for(int j=0;j<input.size();j++)
    {
        for(int k=0;k<6;k++)
        {
            string temp="";
            if(j+k<input.size())
            {
                temp+=input.at(j+k);
                appeared[temp]=true;
            }
        }
    }

int counter=0;
while(appeared[numbers[counter]])counter++;
cout<<numbers[counter]<<endl;

A note about the first part of the algorithm. I generate the vector once, then I use it for 100 strings. I need to parse all 100 strings in less than 4 seconds.

This algorithm is too slow for me as it is currently. Could I optimize some of the code, or should I consider a different approach?

like image 466
A. Andevski Avatar asked Jul 02 '14 19:07

A. Andevski


3 Answers

Idea is to build a tree of numbers that were met:

class Node {
public:
    Node() : count( 0 ) {}
    // create a tree from substring [from, to[ interval
    void build( const std::string &str, size_t from, size_t to )
    {
        Node *node = this;
        while( from != to )
            node = node->insert( str[from++] );
    }

    std::string smallestNumber(  bool root = true, int limit = 0 ) const;

 private:
    Node *insert( char c ) 
    {
        int idx = c - '0';
        if( !children[idx] ) {
            ++count;
            children[idx].reset( new Node );
        }
        return children[idx].get();
    }

    int count;
    std::unique_ptr<Node> children[10];

};

std::string Node::smallestNumber( bool root, int limit ) const
{
    std::string rez;
    if( count < 10 ) { // for this node string is one symbol length
        for( int i = 0; i < 10; ++i )
            if( !children[i] ) return std::string( 1, '0' + i );
        throw std::sruntime_error( "should not happen!" );
    }
    if( limit ) { 
        if( --limit == 1 ) return rez; // we cannot make string length 1
    }
    char digit = '0';
    for( int i = 0; i < 10; ++i ) {
        if( root && i == 0 ) continue;
        std::string tmp = children[i]->smallestNumber( false, limit );
        if( !tmp.empty() ) {
            rez = tmp;
            digit = '0' + i;
            limit = rez.length();
            if( limit == 1 ) break;
        }
    }
    return digit + rez;
}

void calculate( const std::string &str )
{
    Node root;
    for( size_t i = 0; i < str.length(); ++i ) {
        root.build( str, i, i + std::min( 6UL, str.length() - i ) );
    }
    std::cout << "smallest number is:" << root.smallestNumber() << std::endl;
}

int main()
{
    calculate( "1023456789" );
    calculate( "1023479" );
    calculate( "112131405678910" );
    return 0;
}

EDIT: after some thought I realized that inner loop is completely unnecessary. 1 loop is enough. String length is limited to 6, I rely on OPs estimation of biggest number possible.

Output:

smallest number is:11
smallest number is:5
smallest number is:15
like image 120
Slava Avatar answered Oct 10 '22 15:10

Slava


Here's how I would approach the problem. The idea is to generate sets of unique substrings of particular length, starting from the shortest and then testing those before generating longer substrings. This allows the code to not make assumptions about the upper bound of the result and also should be much faster for long input strings that have small results. Still, it's not necessarily better in the worst case of big results.

int find_shortest_subnumber(std::string str) {
    static int starts[10] = {
        0, 10, 100, 1000, 10000, 
        100000, 1000000, 10000000, 100000000, 1000000000
    };
    // can't find substrings longer than 9 (won't fit in int)
    int limit = std::min((int)str.size(), 9);
    for(int length = 1; length <= limit; length++) {
        std::set<std::string> uniques; // unique substrings of current length
        for(int i = 0; i <= (int)str.size() - length; i++) {
            auto start = str.begin() + i;
            uniques.emplace(start, start + length);
        }
        for(int i = starts[length - 1]; i < starts[length]; i++) {
            if(uniques.find(std::to_string(i)) == uniques.end())
                return i;
        }
    }
    return -1; // not found (empty string or too big result)
}

I haven't done proper complexity analysis. I crudely tested the function with a particular test string that was 1 028 880 characters long and had the result of 190 000. It took about 2s to execute on my machine (which includes generation of the test string which should be negligible).

like image 44
eerorika Avatar answered Oct 10 '22 15:10

eerorika


You can construct a suffix tree for the string in linear time (and space). Once you have the suffix tree, you simply need to breadth-first walk it scanning the children of each node in lexicographical order, and checking all 10 digits at each node. The first missing one is the last digit in the smallest missing number.

Since a 1,000,000 digit sequence only has 999,995 six-digit subsequences, there must be at least five six-digit subsequences not present, so the breadth-first search must terminate no later than the sixth level; consequently, it is also linear time.

like image 30
rici Avatar answered Oct 10 '22 16:10

rici