Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find the best fit subsequences of a large string?

Say I have one large string and an array of substrings that when joined equal the large string (with small differences).

For example (note the subtle differences between the strings):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

How can I best align the strings to produce a new set of sub strings from the original large_str? For example:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

Additional Info

The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.

like image 664
Josh Voigts Avatar asked Aug 31 '17 21:08

Josh Voigts


People also ask

How do you find the number of subsequences in a string?

Given a string, find the count of distinct subsequences of it. The problem of counting distinct subsequences is easy if all characters of input string are distinct. The count is equal to nC0 + nC1 + nC2 + … nCn = 2n.

What are subsequences of a string?

A subsequence of a string is a sequence that can be derived from the given string by deleting zero or more elements without changing the order of the remaining elements.

How do you find subsequence?

A subsequence is derived from the original sequence by removing elements, keeping the order. One can remove finite or infinite many elements. Valid edge cases are: remove all elements of the original sequence.

What is the difference between subsequence and substring?

A Substring takes out characters from a string placed between two specified indices in a continuous order. On the other hand, subsequence can be derived from another sequence by deleting some or none of the elements in between but always maintaining the relative order of elements in the original sequence.


1 Answers

  1. Concatenate the substrings
  2. Align the concatenation with the original string
  3. Keep track of which positions in the original string are aligned with the boundaries between the substrings
  4. Split the original string on the positions aligned with those boundaries

An implementation using Python's difflib:

from difflib import SequenceMatcher
from itertools import accumulate

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"

sub_strs = [
  "hello, ths is a lng strin",
  ", that ay be mad up of multiple",
  "subsrings tat aproimately ",
  "match the orginal strng"]

sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))

sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)

match_index = 0
matches = [''] * len(sub_strs)

for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
  if tag == 'delete' or tag == 'insert' or tag == 'replace':
    matches[match_index] += large_str[i1:i2]
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      j1 += submatch_len
  else:
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      matches[match_index] += large_str[i1:i1+submatch_len]
      j1 += submatch_len
      i1 += submatch_len

print(matches)

Output:

['hello, this is a long string', 
 ', that may be made up of multiple ', 
 'substrings that approximately ', 
 'match the original string']
like image 85
Anton Avatar answered Oct 12 '22 19:10

Anton