Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize a list of text additions and deletions

I've got a list containing positions of text additions and deletions, like this:

     Type   Position   Text/Length
1.   +      2          ab          // 'ab' was added at position 2
2.   +      1          cde         // 'cde' was added at position 1
3.   -      4          1           // a character was deleted at position 4

To make it more clear, this is what these operations will do:

    1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    ---------------------------------
    t | e | x | t |   |   |   |   |  
1.  t | a | b | e | x | t |   |   |  
2.  c | d | e | t | a | b | e | x | t
3.  c | d | e | a | b | e | x | t |

The number of actions can be reduced to:

     Type   Position   Text/Length
1.   -      1          1           // 't' was deleted at position 1
2.   +      1          cdeab       // 'cdeab' was added at position 1

Or:

     Type   Position   Text/Length
1.   +      1          cdeab       // 'cdeab' was added at position 1
2.   -      6          1           // 't' was deleted at position 6

These actions are to be saved in my database and in order to optimize this: how can I reduce the number of actions that are to be done to get the same result? Is there a faster way than O(n*n)?

Note that these actions are chronological, changing the order of the actions will give another result.

like image 737
Harmen Avatar asked Jan 16 '10 14:01

Harmen


2 Answers

Not a solution, just some thoughts:

  • Rule 1: if two consecutive operations don't have overlapping ranges, they can be swapped (with positions adjusted)
  • Rule 2: two consecutive inserts or removals at the same position can be joined
  • Rule 3: when an insert is followed by a removal that is completely contained in the insert, they can be joined

I don't see a straightforward algorithm for the shortest solution. However, an heuristic approach using Rule 1 + 2 might be:

  • move operations "up" unless
    • you'd violate Rule 1
    • you'd move an insert before a removal
    • the position is less than that of that predecessor
  • join consecutive inserts/removals at the same position

Applied to the sample, this would mean:

 + 2 ab
 + 1 cde
 - 4 1

Rule 1 (2x):

+ 2 ab
- 1 1   // position adjusted by -3
+ 1 cde

.

- 1 1  
+ 1 ab  // position adjusted
+ 1 cde

Rule 2:

- 1 1
+ 1 cdeab // watch correct order!

A primitive implementation will be O(N*N) - basically, a bubble sort with additonal stopping conditions. I'm not sure about beating down that complexity, since standard algorithms are of no use here due to having to adjust the position.

However, you might be able to improve things notably - e.g. you don't need a "full sort"

like image 144
peterchen Avatar answered Sep 22 '22 02:09

peterchen


Make a binary tree representing the document before and after applying all the changes. Each node represents either original text or inserted/deleted text; the latter kind of node includes both the amount of original text to delete (possibly 0) and the string of text to insert (possibly empty).

Initially the tree has just one node, "0 to end: original text". Apply all the changes to it merging changes as you go wherever possible. Then walk the tree from beginning to end emitting the final set of edits. This is guaranteed to produce the optimal result.

  • Applying an insert: Find the appropriate point in the tree. If it's in the middle of or adjacent to inserted text, just change that node's text-to-insert string. Otherwise add a node.

  • Applying a delete: Find the starting and ending points in the tree—unlike an insert, a delete may cover a whole range of existing nodes. Modify the starting and ending nodes accordingly, and kill all the nodes in between. After you're done, check to see if you have adjacent "inserted/deleted text" nodes. If so, join them.

The only tricky bit is making sure you can find points in the tree, without updating the entire tree every time you make a change. This is done by caching, at each node, the total amount of text represented by that subtree. Then when you make a change, you only have to update these cached values on nodes directly above the nodes you changed.

This looks strictly O(n log n) to me for all input, if you bother to implement a balanced tree and use ropes for the inserted text. If you ditch the whole tree idea and use vectors and strings, it's O(n2) but might work fine in practice.

Worked example. Here is how this algorithm would apply to your example, step by step. Instead of doing complicated ascii art, I'll turn the tree on its side, show the nodes in order, and show the tree structure by indentation. I hope it's clear.

Initial state:

*: orig

I said above we would cache the amount of text in each subtree. Here I just put a * for the number of bytes because this node contains the whole document, and we don't know how long that is. You could use any large-enough number, say 0x4000000000000000L.

After inserting "ab" at position 2:

    2: orig, 2 bytes
*: insert "ab", delete nothing
    *: orig, all the rest

After inserting "cde" at position 1:

        1: orig, 1 byte
    5: insert "cde", delete nothing
        1: orig, 1 byte
*: insert "ab", delete nothing
    *: orig, all the rest

The next step is to delete a character at position 4. Pause here to see how we find position 4 in the tree.

Start at the root. Look at the first child node: that subtree contains 5 characters. So position 4 must be in there. Move to that node. Look at its first child node. This time it contains only 1 character. Not in there. This edit contains 3 characters, so it's not in here either; it's immediately after. Move to the second child node. (This algorithm is about 12 lines of code.)

After deleting 1 character at position 4, you get this...

    4: orig, 1 byte
        3: insert "cde", delete nothing
*: insert "ab", delete nothing
    *: orig, all the rest

...and then, noticing two adjacent insert nodes, you merge them. (Note that given two adjacent nodes, one is always somewhere above the other in the tree hierarchy. Merge the data into that higher node; then delete the lower one and update the cached subtree sizes in between.)

    1: orig, 1 byte
*: insert "cdeab", delete nothing
    *: orig, all the rest
like image 24
Jason Orendorff Avatar answered Sep 22 '22 02:09

Jason Orendorff