Say I have one large string and an array of substrings that when joined equal the large string (with small differences).
For example (note the subtle differences between the strings):
large_str = "hello, this is a long string, that may be made up of multiple
substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
"subsrings tat aproimately ", "match the orginal strng"]
How can I best align the strings to produce a new set of sub strings from the original large_str
? For example:
["hello, this is a long string", ", that may be made up of multiple",
"substrings that approximately ", "match the original string"]
Additional Info
The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.
Given a string, find the count of distinct subsequences of it. The problem of counting distinct subsequences is easy if all characters of input string are distinct. The count is equal to nC0 + nC1 + nC2 + … nCn = 2n.
A subsequence of a string is a sequence that can be derived from the given string by deleting zero or more elements without changing the order of the remaining elements.
A subsequence is derived from the original sequence by removing elements, keeping the order. One can remove finite or infinite many elements. Valid edge cases are: remove all elements of the original sequence.
A Substring takes out characters from a string placed between two specified indices in a continuous order. On the other hand, subsequence can be derived from another sequence by deleting some or none of the elements in between but always maintaining the relative order of elements in the original sequence.
An implementation using Python's difflib:
from difflib import SequenceMatcher
from itertools import accumulate
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = [
"hello, ths is a lng strin",
", that ay be mad up of multiple",
"subsrings tat aproimately ",
"match the orginal strng"]
sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
match_index = 0
matches = [''] * len(sub_strs)
for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
if tag == 'delete' or tag == 'insert' or tag == 'replace':
matches[match_index] += large_str[i1:i2]
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
j1 += submatch_len
else:
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
matches[match_index] += large_str[i1:i1+submatch_len]
j1 += submatch_len
i1 += submatch_len
print(matches)
Output:
['hello, this is a long string',
', that may be made up of multiple ',
'substrings that approximately ',
'match the original string']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With