Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for finding all of the shared substrings of any length between 2 strings, and then counting occurrences in string 2?

I've run into an unusual challenge and so far I'm unable to determine the most efficient algorithm to attack this.


Given the following 2 strings as an example, find all commonly shared substrings between the 2 strings of any length, and count the number of occurrences of all of those shared substrings in string 2. Your algorithm also needs to be able to compute shared substrings between files containing strings that are up to 100MB or more in size.

Example:

String 1: ABCDE512ABC361EG51D

String 2: ADE5AHDW4131EG1DG5C

Given these 2 strings this algorithm would find the following shared substrings: A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG

And then from these commonly shared substrings, we'd find how many occurences there are of each of them in string 2.

A: 2 occurrences in string 2

C: 1 occurence in string 2

D: 3 occurrences in string 2

etc..


The first approach I took to wrap my head around this problem was brute forcing my way through computing the common shared substrings using 2 nested for loops - obviously the least efficient but it was a quick and dirty way to get an idea of what the expected outputs should be with smaller test input and the slowest possible time to run, which was around 2 minutes to compute all common shared substrings between 2 files containing ascii strings with a size of 50kb. Upping the size to 1mb made this come to a screeching halt due to the massive number of total nested iterations that had to occur to compute this.

The next approach was using trees - seeing how much memory I could trade off to optimize compute time. This approach was much faster. The same two 50kb files that took 2 minute with the brute force method were near instant. Running against 1mb files was very fast too still (seconds) but as I continued to test with larger and larger file sizes, I quickly began running into memory issues due to tree sizes.


Note: The string files will only ever contain ASCII characters!


Edit:

I'm escalating this a bit further, please see:

https://gist.github.com/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc

like image 960
Braydon Batungbacal Avatar asked Nov 04 '16 22:11

Braydon Batungbacal


People also ask

How do you find all substrings in length K?

If the length of a string is N, then there can be N – K + 1 substring of length K. Generating these substrings will require O(N) complexity, and checking each substring requires O(K) complexity, hence making the overall complexity like O(N*K). Efficient approach: The idea is to use Window Sliding Technique.

How many substrings are in a string of length n?

Approach: The count of sub-strings of length n will always be len – n + 1 where len is the length of the given string.

How do I split a string into number of substrings?

Use the Split method when the substrings you want are separated by a known delimiting character (or characters). Regular expressions are useful when the string conforms to a fixed pattern. Use the IndexOf and Substring methods in conjunction when you don't want to extract all of the substrings in a string.


3 Answers

Here is some code illustrating the idea I presented in the comments above. Although it is runnable C++ code, it is more pseudo-code in the sense that the utilized data structures are surely not optimal but they allow a clear view on the algorithm.

struct Occurrence
{
    //The vectors contain indices to the first character of the occurrence in ...
    std::vector<size_t> s1;  // ... string 1 and ...
    std::vector<size_t> s2;  // ... string 2.
};

int main()
{
    //If you cannot load the entire strings in memory, a memory-mapped file might be
    //worth considering
    std::string s1 = "ABCDE512ABC361EG51D";
    std::string s2 = "ADE5AHDW4131EG1DG5C";

    //These vectors store the occurrences of substrings for the current and next length
    std::vector<Occurrence> occurrences, nextOccurrences;
    int length = 1;

    std::map<char, Occurrence> occurrenceMap;
    //Initialize occurrences
    for (int i = 0; i < s1.length(); ++i)
        occurrenceMap[s1[i]].s1.push_back(i);
    for (int i = 0; i < s2.length(); ++i)
        occurrenceMap[s2[i]].s2.push_back(i);

    for (auto& pair : occurrenceMap)
    {
        if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
            occurrences.push_back(std::move(pair.second));
    }

    do
    {
        nextOccurrences.clear();

        std::cout << "Length " << length << std::endl;
        for(auto& o : occurrences)
        {
            std::cout << std::string(s1.c_str() + o.s1[0], length) << " occurred "
                      << o.s1.size() << " / " << o.s2.size() << " times." << std::endl;

            //Expand the occurrence
            occurrenceMap.clear();
            for (auto p : o.s1)
            {
                if (p + length < s1.length())
                    occurrenceMap[s1[p + length]].s1.push_back(p);
            }                   
            for (auto p : o.s2)
            {
                if (p + length < s2.length())
                occurrenceMap[s2[p + length]].s2.push_back(p);
            }
            for (auto& pair : occurrenceMap)
            {
                if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
                    nextOccurrences.push_back(std::move(pair.second));
            }
        }

        ++length;
        std::swap(occurrences, nextOccurrences);

    } while (!occurrences.empty());


    return 0;
}

Output:

Length 1
1 occurred 3 / 3 times.
3 occurred 1 / 1 times.
5 occurred 2 / 2 times.
A occurred 2 / 2 times.
C occurred 2 / 1 times.
D occurred 2 / 3 times.
E occurred 2 / 2 times.
G occurred 1 / 2 times.
Length 2
1D occurred 1 / 1 times.
1E occurred 1 / 1 times.
DE occurred 1 / 1 times.
E5 occurred 1 / 1 times.
EG occurred 1 / 1 times.
G5 occurred 1 / 1 times.
Length 3
1EG occurred 1 / 1 times.
DE5 occurred 1 / 1 times.

The most amount of memory will be used during initialization because there will be an entry for every character of both input strings. If you know the approximate length of the strings, you can choose a more appropriate index data type than size_t. The amount of memory needed is in the order of the input size. So two 100 MB files should be no problem for common computers. After the initialization (more specifically, after the first iteration of the loop), most of these data will be deleted because it is not needed any more.

like image 85
Nico Schertler Avatar answered Oct 17 '22 07:10

Nico Schertler


Here's a C implementation based on traversing the suffix array of the concatenation of the inputs, with the help of the longest common prefix array. You can replace the programming-contest-grade (O(n log^2 n)) suffix array implementation with a real one (O(n) or O(n log n)) for a large performance improvement. (EDIT: did this, with some other changes reflecting the asker's new requirements: https://github.com/eisenstatdavid/commonsub .)

#include <inttypes.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

typedef int_fast32_t I32;

#define Constrain(expression) _Static_assert(expression, #expression)
Constrain(CHAR_BIT == 8);
#define InputMaxBytes 80000000
Constrain(InputMaxBytes <= (INT_LEAST32_MAX - 2) / 2);
#define MaxLen (2 * InputMaxBytes + 2)
Constrain(MaxLen <= INT_FAST32_MAX / 2);

static I32 Len;
static I32 Begin2;
static signed char Buf[MaxLen];
static int_least32_t SufArr[MaxLen];
static int_least32_t SufRank[MaxLen];
static int_least32_t NewRank[MaxLen];
static int_least32_t *const LongCommPre = NewRank;  // aliased to save space
static uint_least64_t Bitmap2[(MaxLen >> 6) + 1];
static int_least32_t SparseCount2[(MaxLen >> 6) + 1];
static int_least32_t *const Stack = SufRank;  // aliased to save space

static void Slurp(const char *filename) {
  FILE *stream = fopen(filename, "r");
  if (stream == NULL) goto fail;
  I32 n = fread(Buf + Len, sizeof *Buf, InputMaxBytes + 1, stream);
  if (ferror(stream)) goto fail;
  if (n > InputMaxBytes) {
    fprintf(stderr, "%s: file is too large; increase InputMaxBytes\n",
            filename);
    exit(EXIT_FAILURE);
  }
  for (I32 i = 0; i < n; i++) {
    if (Buf[Len + i] < 0) {
      fprintf(stderr,
              "%s: file contains non-ASCII byte at offset %" PRIdFAST32 "\n",
              filename, i);
      exit(EXIT_FAILURE);
    }
  }
  Len += n;
  if (fclose(stream) == EOF) goto fail;
  return;
fail:
  perror(filename);
  exit(EXIT_FAILURE);
}

static I32 Radix;

static int CompareRankPairs(const void *iPtr, const void *jPtr) {
  I32 i = *(const int_least32_t *)iPtr;
  I32 j = *(const int_least32_t *)jPtr;
  if (SufRank[i] < SufRank[j]) return -1;
  if (SufRank[i] > SufRank[j]) return 1;
  I32 iRank = i + Radix < Len ? SufRank[i + Radix] : -2;
  I32 jRank = j + Radix < Len ? SufRank[j + Radix] : -2;
  if (iRank < jRank) return -1;
  if (iRank > jRank) return 1;
  return 0;
}

static void BuildSuffixArray(void) {
  for (I32 i = 0; i < Len; i++) {
    SufArr[i] = i;
    SufRank[i] = Buf[i];
  }
  for (Radix = 1; true; Radix *= 2) {
    qsort(SufArr, Len, sizeof *SufArr, CompareRankPairs);
    NewRank[0] = 0;
    for (I32 i = 1; i < Len; i++) {
      NewRank[i] = CompareRankPairs(&SufArr[i - 1], &SufArr[i]) == 0
                       ? NewRank[i - 1]
                       : NewRank[i - 1] + 1;
    }
    for (I32 i = 0; i < Len; i++) {
      SufRank[SufArr[i]] = NewRank[i];
    }
    if (NewRank[Len - 1] == Len - 1) break;
  }

  I32 lenCommPre = 0;
  for (I32 i = 0; i < Len; i++) {
    if (SufRank[i] == Len - 1) {
      LongCommPre[SufRank[i]] = -1;
      continue;
    }
    while (Buf[i + lenCommPre] == Buf[SufArr[SufRank[i] + 1] + lenCommPre]) {
      lenCommPre++;
    }
    LongCommPre[SufRank[i]] = lenCommPre;
    if (lenCommPre > 0) lenCommPre--;
  }
}

static I32 PopCount(uint_fast64_t x) {
  I32 v = 0;
  while (x != 0) {
    x &= x - 1;
    v++;
  }
  return v;
}

static void BuildCumCount2(void) {
  for (I32 i = 0; i < Len; i++) {
    if (SufArr[i] >= Begin2) {
      Bitmap2[i >> 6] |= UINT64_C(1) << (i & 63);
      SparseCount2[i >> 6]++;
    }
  }
  for (I32 i = 0; i < (Len >> 6); i++) {
    SparseCount2[i + 1] += SparseCount2[i];
  }
}

static I32 CumCount2(I32 i) {
  return SparseCount2[i >> 6] - PopCount(Bitmap2[i >> 6] >> (i & 63));
}

static void FindCommonStrings(void) {
  I32 lenCommPre = -1;
  for (I32 i = 0; i < Len; i++) {
    while (lenCommPre > LongCommPre[i]) {
      I32 begin = Stack[lenCommPre];
      I32 end = i + 1;
      I32 count2 = CumCount2(end) - CumCount2(begin);
      if (count2 > 0 && count2 < end - begin && lenCommPre > 0) {
        printf("%" PRIdFAST32 "\t%.*s\n", count2, (int)lenCommPre,
               Buf + SufArr[begin]);
      }
      lenCommPre--;
    }
    while (lenCommPre < LongCommPre[i]) {
      lenCommPre++;
      Stack[lenCommPre] = i;
    }
  }
}

int main(int argc, char *argv[]) {
  if (argc != 3) {
    fputs("usage: commonsub needle haystack\n", stderr);
    exit(EXIT_FAILURE);
  }
  Len = 0;
  Slurp(argv[1]);
  Buf[Len] = -1;
  Len++;
  Begin2 = Len;
  Slurp(argv[2]);
  Buf[Len] = -2;  // sentinel
  BuildSuffixArray();
  if (false) {
    for (I32 i = 0; i < Len; i++) {
      printf("%" PRIdFAST32 "\t%" PRIdLEAST32 "\t%" PRIdLEAST32 "\t%.*s\n", i,
             SufArr[i], LongCommPre[i], (int)(Len - SufArr[i]),
             Buf + SufArr[i]);
    }
  }
  BuildCumCount2();
  FindCommonStrings();
}
like image 5
David Eisenstat Avatar answered Oct 17 '22 09:10

David Eisenstat


After looking at the two strings and thinking about this for a bit I've done this procedure in my head and now I'm going to translate it into steps.

String 1: ABCDE512ABC361EG51D  // S1
String 2: ADE5AHDW4131EG1DG5C  // S2

When I was thinking about this we can compare characters and or substrings from S1 to S2 while keeping track of occurrences.

S1[0] = 'A'  compare S2[0]  = 'A' = true : found A in S2 at location 0
S1[0] = 'A'  compare S2[1]  = 'D' = false
S1[0] = 'A'  compare S2[2]  = 'E' = false
S1[0] = 'A'  compare S2[3]  = '5' = false
S1[0] = 'A'  compare S2[4]  = 'A' = true : found A in S2 at location 4
S1[0] = 'A'  compare S2[5]  = 'H' = false
S1[0] = 'A'  compare S2[6]  = 'D' = false
S1[0] = 'A'  compare S2[7]  = 'W' = false
S1[0] = 'A'  compare S2[8]  = '4' = false
S1[0] = 'A'  compare S2[9]  = '1' = false
S1[0] = 'A'  compare S2[10] = '3' = false
S1[0] = 'A'  compare S2[11] = '1' = false; 
S1[0] = 'A'  compare S2[12] = 'E' = false; 
S1[0] = 'A'  compare S2[13] = 'G' = false;
S1[0] = 'A'  compare S2[14] = '1' = false;
S1[0] = 'A'  compare S2[15] = 'D' = false;
S1[0] = 'A'  compare S2[16] = 'G' = false;
S1[0] = 'A'  compare S2[17] = '5' = false;
S1[0] = 'A'  compare S2[18] = 'C' = false;

// End of First Search - Occurrences of 'A' in S2 is 2 at locations {0,4}

// Next Iteration
String 1: ABCDE512ABC361EG51D  // S1
String 2: ADE5AHDW4131EG1DG5C  // S2

// Repeat this for all single characters Of S1 against S2
'A' in S2 = 2  at {0,4}
'B' in S2 = 0 
'C' in S2 = 1  at {18}
'D' in S2 = 3  at {1,6,15}
'E' in S2 = 2  at {2,12}
'5' in S2 = 2  at {3,17}
'1' in S2 = 3  at {9,11,14}
'2' in S2 = 0
'A' Already Found Above Skip
'B' Already Found Above Skip
'C' Already Found Above Skip
'3' in S2 = 1  at {10}
'6' in S2 = 0
'1' Already Found Above Skip
'E' Already Found Above Skip
'G' in S2 = 2  at {13, 16}
'5' Already Found Above Skip
'1' Already Found Above Skip
'D' Already Found Above Skip

This would conclude the first set of iterations for doing all single characters and as you can see we also built a list and a map or sets of not only occurrences but also their locations and we can store them for future references. So if we begin to search for S1[0 & 1] in S2 we know that S1[1] does not exist in S2 so we can break and don't need to go down that chain and since we can break out of that branch we can also skip over doing S1[1 & ...N] and move directly to S1[2] and we know that there is only 1 occurrence of S1[2] which is 'C' in S2 located at {18} which is the end of the string so there is no need to look for S1[2 & ... N] so we can skip over this and move to S1[3] which is 'D' and we know that it does exist in S2 at {1,6,15} so now we can begin our search of S1[3 & ... N] beginning with S2[1 & ... N] then again do the same search of S1[3 & ... N] starting at S2[6 & ... N] and finally again starting S2[15 & ...N] then we have now found all sub strings that start with D in S2 and we can save their occurrences; however this is were we do want to find the longest substring between the two. The longest sub string is "DE5" and there is only one occurrence of it, but from this we have also already found the sub strings "DE" & "E5" so we can search for them at this point as well and we then find that there are 1 occurrence of each. And we just repeat this process. It will take sort of a long time at first, but the more you traverse through the strings, the faster it will work because of eliminating already found occurrences as well as skipping over non found sub strings of S1 in S2.

This is the logical approach that I took without using any code or programming semantics for it is just the basic algorithm of doing this logically. It now becomes a matter of determination to put this into functions and containers to write a source code implementation of it.

EDIT - As asked in the comments about the difference of this versus another's answer and with the time & space complexity here is a version of my algorithm doing the first pass searching for single characters and creating the tables of positions and if they exist in the 2nd string. The stored vector in the class contains each unique character in S1 within S2. This can then be used to help find longer substrings.

// C++ - The user asked for this in C but I haven't used C in nearly 10 years so this is my version of it in C++ :( 
#include <string>
#include <vector>

class SubStringSearch {
private:
    std::string S1;
    std::string S2; 

    struct SubstringResult {
        std::string substring;
        bool found;
        std::vector<unsigned> positions;

        SubstringResult(){}
        SubstringResult( const std::string& substringIn, bool foundIn, std::vector<unsigned> positionsIn ) :
            substring( substringIn ), found( foundIn ), positions( positionsIn ) {}
    };

    std::vector<SubstringResult> results;

public:
    SubStringSearch( const std::string& s1, const std::string& s2 ) : S1( s1 ), S2( s2 ) {}

    void compareStringsFirstPass();
    std::vector<unsigned> findLocations( const std::string& str, char findIt );
    void printResults() const;

};

std::vector<unsigned> SubStringSearch::findLocations( const std::string& str, char findIt ) {
    std::vector<unsigned> locations;
    for ( unsigned i = 0; i < str.size(); ++i ) {
        if ( str[i] == findIt ) {
            locations.push_back( i );
        }
    }
    return locations;
}

void SubStringSearch::compareStringsFirstPass() {
    std::vector<unsigned> positions;
    std::string sub;
    bool alreadyFound = false;

    for ( unsigned idx = 0; idx < S1.size(); ++idx ) {
        sub = S1[idx];

        if ( idx > 0 ) {
            for ( unsigned u = 0; u < results.size(); ++u ) {
                if ( sub == results[u].substring ) {
                    alreadyFound = true;
                    break;
                }
            }
        }

        // Added An If Else Here To Reduce Unneeded Calls To findLocations()
        if ( alreadyFound ) {
            alreadyFound = false;
            continue;
        } else {
            positions = findLocations( S2, S1[idx] );
        }

        if ( positions.size() > 0 && !alreadyFound ) {
            results.push_back( SubstringResult( sub, true, positions ) );
        } else if ( !alreadyFound ) {
            positions.clear();
            results.push_back( SubstringResult( sub, false, positions ) );
        }

        positions.clear();
        alreadyFound = false;
    }
}

void SubStringSearch::printResults() const {
    for ( unsigned u = 0; u < results.size(); ++u ) {
        if ( results[u].found ) {
            std::cout << results[u].substring << " found in S2 at " << std::setw(2);
            for ( unsigned i = 0; i < results[u].positions.size(); ++i ) {
                std::cout << std::setw(2) << results[u].positions[i] << " ";
            }
            std::cout << std::endl;
        }
    }
}

int main() {
    std::string S1( "ABCDE512ABC361EG51D" );
    std::string S2( "ADE5AHDW4131EG1DG5C" );

    SubStringSearch searchStrings( S1, S2 );
    searchStrings.compareStringsFirstPass();

    std::cout << "break point";

    return 0;
} // main

Place a break point on that last print line and go into your debugger for either your locals or your autos in MSVC or something equivalent for your version of your compiler / debugger and check out the contents of the class's member variable that is a std::vector and you will see the character from S1 and attached to it will be a bool flag if it is found or not as well as a std::vector for each of the positions. So if the flag is false then the vector size should be 0 and vise versa if the vector size is > 0 then the flag should be true; also the size of the vector of positions is also the count or the occurrences of that character in the 2nd string which makes this nice because we don't have to calculate anything else we can just get that from the vector itself.

Now this is not the complete or full algorithm as this is only the first pass of doing each single character of string 1 and looking into string 2 while building the needed table and skipping over contents that have already been found. It will be up to the OP to build upon this if they so choose to complete the rest of the algorithm. If I happen to find some free time in the near future I may go ahead and complete the full algorithm.

like image 1
Francis Cugler Avatar answered Oct 17 '22 08:10

Francis Cugler