Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Context sensitive diff implementation

Summary and basic question

Using MS Access 2010 and VBA (sigh..)

I am attempting to implement a specialized Diff function that is capable of outputting a list of changes in different ways depending on what has changed. I need to be able to generate a concise list of changes to submit for our records.

I would like to use something such as html tags like <span class="references">These are references 1, 6</span> so that I can review the changes with code and customize how the change text is outputted. Or anything else to accomplish my task.

I see this as a way to provide an extensible way to customize the output, and possibly move things into a more robust platform and actually use html/css.

Does anyone know of a similar project that may be able to point me in the right direction?

My task

I have an access database with tables of work operation instructions - typically 200-300 operations, many of which are changing from one revision to another. I have currently implemented a function that iterates through tables, finds instructions that have changed and compares them.

Note that each operation instruction is typically a couple sentences with a couple lines at the end with some document references.

My algorithm is based on "An O(ND) Difference Algorithm and Its Variations" and it works great.

Access supports "Rich" text, which is just glorified simple html, so I can easily generate the full text with formatted additions and deletions, i.e. adding tags like <font color = "red"><strong><i>This text has been removed</i></strong></font>. The main output from the Diff procedure is a full text of the operation that includes non-changed, deleted, and inserted text inline with each other. The diff procedure adds <del> and <ins> tags that are later replaced with the formatting text later (The result is something similar to the view of changes from stack exchange edits).

However, like I said, I need the changes listed in human readable format. This has proven difficult because of the ambiguity many changes create.

for example: If a type of chemical is being changed from "Class A" to "Class C", the change text that is easily generated is "Change 'A' to 'C'", which is not very useful to someone reviewing the changes. More common are document reference at the end: Adding SOP 3 to the list such as "SOP 1, 2, 3" generates the text "Add '3'". Clearly not useful either.

What would be most useful is a custom output for text designated as "SOP" text so that the output would be "Add reference to SOP 3".

I started with the following solution:

Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare. This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3". This get's cluttered when there is a large list and you are attempting to determine what actually changed.

Where I am now

I am now attempting to add extra html tags before running the the diff algorithm. For example, I will run the text through a "pre-processor" that will convert "SOP 1, 2" to SOP 1, 2

Once the Diff procedure returns the full change text, I scan through it noting the current "class" of text and when there is a <del> or <ins> I capture the text between the tags and use a SELECT CASE block over the class to address each change.

This actually works okay for the most part, but there are many issues that I have to work through, such add Diff deciding that the shortest path is to delete certain opening tags and insert other ones. This creates a scenario that there are two <span> tags but only one </span> tag.

The ultimate question

I am looking for advise to either continue with the direction I have started or to try something different before investing a lot more time into a sub-optimal solution.

Thanks all in advance.

Also note:

The time for a typical run is approximately 1.5 to 2.5 seconds with me attempting more fancy things and a bunch of debug.prints. So running through an extra pass or two wouldn't be killer.

like image 726
ptpaterson Avatar asked Dec 23 '13 21:12

ptpaterson


3 Answers

It is clear that reporting a difference in terms of the smallest change to the structures you have isn't what you want; you want to report some context.

To do that, you have to identify what context there is to report, so you can decide what part of that is interesting. You sketched an idea where you fused certain elements of your structure together (e.g., 'SOP' '1' '2' into 'SOP 1 2'), but that seems to me to be going the wrong way. What it is doing is changing the size of the smallest structure elements, not reporting better context.

While I'm not sure this is the right approach, I'd try to characterize your structures using a grammar, e.g., a BNF. For instance, some grammar rules you might have would be:

 action = 'SOP' sop_item_list ;
 sop_item_list = NATURAL ;
 sop_item_list = sop_item_list ',' NATURAL ;

Now an actual SOP item can be characterized an an abstract syntax tree (show nested children, indexable by constants to get to subtrees):

 t1: action('SOP',sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]))

You still want to compute a difference using something like the dynamic programming algorithm you have suggested, but now you want a minimal tree delta. Done right (my company makes tools that do this for grammars for conventional computer languages, and you can find publicly available tree diff algorithms), you can get a delta like:

  replace t1[2] with op_item_list(sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]),',',NATURAL[3]

which in essence is what you got by gluing (SOP,1,2) into a single item, but without the external adhocery of you personally deciding to do that.

The real value in this, I think, it that you can use the tree t1 to report the context. In particular, you start at the root of the tree, and print summaries of the subtrees (clearly you don't want to print full subtrees, as that would just give you back the full original text). By printing subtrees down to a depth of 1 or two levels and eliding anything deep (e.g, representing a list as "..." and single subtree as "_"), you could print something like:

replace 'SOP' ... with 'SOP',...,3

which I think is what you are looking for in your specific example.

No, this isn't an algorithm; it is a sketch of an idea. The fact we have tree-delta algorithms that compute useful deltas, and the summarizing idea (taken from LISP debuggers, frankly) suggests this will probably generalize to something useful, or a least take you in a new direction.

Having your answer in terms of ASTs, should also make it relatively easy to produce HTML as you desire. (People work with XML all the time, and XML is basically a tree representation).

like image 166
Ira Baxter Avatar answered Sep 19 '22 12:09

Ira Baxter


Try thinking in terms of Prolog-style rewrite rules that transform your instructions into a canonical form that will cause the diff algorithm to produce what you need. The problem you specified would be solved by this rule:

SOP i1, i2, ... iN  ->  SOP j1, SOP j2, ... SOP jN  where j = sorted(i)

In other words, "distribute" SOP over a sorted list of the following integers. This will trick the diff algorithm into giving a fully qualified change report "Add SOP 3."

Rules are applied by searching the input for matches of the left hand side and replacing with the corresponding right.

You are probably already doing this, but you will get more commonsense analysis if the input is tokenized: "SOP" should considered a single "character" for the diff algorithm. Whitespace may be reduced to tokens for space and line break if they are significant or else ignored.

You can do another kind of diff at a the character level to test "fuzzy" equality of tokens to account for typographical errors when matching left-hand-sides. "SIP" and "SOP" would be counted a "match" because their edit distance is only 1 (and I and O are only one key apart on a QUERTY keyboard!).

If you consider all the quirks in the output you're getting now and can rectify each one as a rewrite rule that takes the input to a form where the diff algorithm produces what you need, then what's left is to implement the rewrite system. Doing this in a general way that is efficient so that changing and adding rules does not involve a great deal of ad hoc coding is a difficult problem, but one that has been studied. It's interesting that @Ira Baxter mentioned lisp, as it excels as a tool for this sort of thing.

Ira's suggestion of syntax trees falls naturally into the rewrite rule method. For example, suppose SOPs have sections and paragraphs:

SOP 1 section 3, paras 2,1,3

is a hierarchy that should be rewritten as

SOP 1 section 3 para 1, SOP 1 section 3 para 2, SOP 1 section 3 para 3

The rewrite rules

paras i1, i2, ... iN  ->  para j1, para j2, ... para jN  where j = sorted(i)
section K para i1, ... para iN  ->s section K para j1, ... section K para j1
SOP section K para i1, ... section K para i1 -> SOP section K para j1, ... SOP section K para j1 

when applied in three passes will produce a result like "SOP 1 section 3, para 4 was added."

While there are many strategies to implementing the rules and rewrites, including coding each one as a procedure in VB (argh...), there are other ways. Prolog is a grand attempt to do this as generally as possible. There is a free implementation. There are others. I have used TXL to get useful rewriting done. The only problem with TXL is that it assumes you have a rather strict grammar for inputs, which it doesn't sound like the case in your problem.

If you post more examples of the quirks in your current outputs, I can follow this up with more detail on rewrite rules.

like image 35
Gene Avatar answered Sep 22 '22 12:09

Gene


In case you would decide to proceed with what you already achieved (IMO you got pretty far already), you might consider doing two steps of diff.

Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare.

That's a good start; you already managed to make the context clear to the user.

This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3'". This get's cluttered when there is a large list and you are attempting to determine what actually changed.

How about doing another diff pass across the tokens found (i.e. compare 'SOP 1, 2' with 'SOP 1, 2, 3'), this time without the word grouping, to generate additional info? That would make the full output something like this:

Change 'SOP 1, 2' to 'SOP 1, 2, 3'

Change details: Add '3'

The text is a bit cryptic, so you may want to do some refining there. I would also suggest to truncate lengthy tokens in the first line ('SOP 1, 2, 3, ...'), since the second line should already provide sufficient detail.

I am not sure about the performance impact of this second pass; in a big text with many changes, you may experience many roundtrips to the diff functionality. You might optimize by accumulating changes from the first pass into one 'change document', run the second pass across it, then stitch the results together.

HTH.

like image 20
Ruud Helderman Avatar answered Sep 21 '22 12:09

Ruud Helderman